which represent an inaccurate or irrelevant response to a prompt. How do we determine if an LLM is hallucinating? We start by measuring text similarity. Let's take a look at how to do that now. So let's get started with hallucinations and relevance. There's many different ways you can calculate whether or not a LLM hallucinates, which means it's giving a answer that may seem okay at first glance, but really is low quality due to irrelevance, so not being related to the question that was asked, or inaccuracy, having factual or otherwise inaccurate information. We'll explore this using a number of different metrics and with different comparisons of the text. So now we're thinking about hallucinations and relevance. We'll be approaching this task by looking at two types of comparisons, the prompt to the response and the response with two other responses to the same prompt from the LLM. We can use a variety of different metrics to be able to do so. We'll look at four different metrics, which you can see here, all of different characteristics, and we'll talk about the details of each when we go along. First, let's get started with setup. We'll import our helpers module that we've been using throughout the course. And now we'll import Evaluate. So Evaluate is a Huckingface library that includes a number of different evaluation metrics for machine learning. So my own research is largely centered around evaluation metrics for machine learning. There's something really painful about using evaluation metrics and implementing those evaluation metrics. Often they're not fully described in the papers or resources when they're first created. And additionally, they're rarely implemented exactly the same way across open source tools. That's why it can be really helpful to have packages like Evaluate gain popularity. When one package with a single implementation gains popularity, we start to find more of a consensus on the implementation details. We'll get started by looking at prompt response relevance using blue scores. Blue scores have been long used in the natural language processing community, particularly for machine translation. It's a very interesting metric, but it does have some downfalls. Blue scores rely on similarities across the same tokens. Blue scores give us a score from 0 to 1, but the score that's given really depends on the data set. For example, the original paper that introduced the metric saw blue scores between 0.05 and 0.26. Other instances have blue scores up to 0.8. It really depends on the data set that you're using, and they're not easily comparable across data sets or tasks. So how do we calculate blue score? First, we need to capture some important information. So let's go ahead and load the blue score code for us to use. The evaluate package just takes load and then the name blue, and we'll save it as a variable named blue. Here, for one prompt, approximately how many atoms are in the known universe, we get a response. So let's go ahead and call our blue function. We see a number of outputs here. So the first thing is our blue score. The second are a number of precision values, so two of them, and then a number of penalties and lengths. The blue score is the most important part, and the part we'll be using for our metric. We'll see a little bit about how the precision works. If you're curious about where those precision scores came from, they're all about comparing tokens across the two text references. So for unigrams, we're looking for a single token, and tokens are often words, although they don't have to be, a single token. And do we see the presence of that token in both text examples? A bigram is one step up from that. We're not looking for individual words, but we're looking for pairs of words together in both examples. So, despite there being a lot of common language between these two, the only true bigram match of these two are in B. And blue score is calculated using these comparisons. So we progressively measure unigrams, bigrams, trigrams, and other engrams, and we weight those in different ways to combine them for a score. So now that we see how to calculate a single blue score, let's go ahead and create a metric for it. We need to import a function from ylogs to be able to do this. So this function is a decorator. This is a function we can add to decorate a class or another function in our Python code. This decorator registers a function as a new metric to use in ylogs. So our function here is going to be blue score. The parameter name is arbitrary, but I like to use text to remind me of the data type that will be used. The output of this function needs to be a list of scores for the data that we see. So I'm going to pass in. In the middle, we need to write a function that will calculate the blue score using the function we just used. In this case, text is a dictionary that includes both the prompt and response. Now we've created a new metric. Let's go ahead and visualize that metric using the helper functions that we've used in the past. This time, the metric name that we're passing in matches the metric name used in the decorator for the method. Okay. What we can see here is that blue scores are heavily tailed. Many of the scores are very low in our instance, and a number of them get up close to 0.5. Now, let's look at the examples that have the lowest blue scores. These are the ones that are more likely to be hallucinations. To make sure we're looking at the lowest, we set ascending to true. Okay. Here's a number of examples, but remember that many of the blue scores were close to zero. Now let's do a similar exercise with BERT scores. So how does a BERT score work? Unlike the blue score, which was focused on the exact text of the tokens and comparing those, BERT score uses embeddings to find a semantic match between words. So how does this work? We take our two text samples and we calculate contextual embeddings for each of the specific words. Contextual embeddings are different from static embeddings because they give different embedding values depending on the context around the word. You can see the difference most easily for words like bank, which could mean snow bank or a bank that you take your money to. The context around the word bank in the sentence can help determine the difference in the embedding value. With static embeddings, you'll get the same embedding for the word bank, regardless of which usage you're meaning to represent. Once you have the embeddings for each word, we find the pairwise cosine similarity between them. Each word in our prompt is compared to each word in our response. Unlike blue scores, BERT scores use semantic matches for the text. We also use a different algorithm for comparing. So instead of using precisions, we find these max similarities and use different methods for calculating BERT scores, but often importance weighting. We load the BERT score module and then we can call it with a prompt and response. First, we'll do this with just one row of the data. I grabbed row two here for a particular model type. Okay, so our results are a precision, a recall value, an F1 score. And for those who are unfamiliar, F1 scores are a weighted average of precision and recall. Let's go ahead and create a new metric for BERT scores. First, we'll add our decorator. Then we'll add our new BERT score function. and we'll make sure to return a list of the F1 scores as our metric. You might notice that the implementation for this is very different. The BERT score function takes in lists of predictions and lists of references in a different way than the blue score does. Let's visualize this new metric. You can see here that the BERT score distribution looks quite different from the blue score distribution. This one looks much more like a bell curve, with the highest frequency values being in the middle. So now let's look at some of the queries that give us low BERT scores. So if we have a low BERT score, we're more concerned about this response being a hallucination, because the prompt is different from the response, at least according to this metric. So we can see a couple of flaws with using a score like BERT score for finding hallucinations. One that exists for a lot of these metrics is in the line 48 here. We have a prompt that has many words and we have a response that has a single word. So even though the word moo is probably similar semantically to the word cow in some ways, the full prompt differs quite a bit from the response alone. Another example you can see at the bottom, the prompt is very short, hello, and the response is how can I assist you today? This is a perfect valid way to responding to hello, but because the topic of the prompt and response are different, we'll see this come out as a low BERT score. Now, let's check out the evaluation for our BERT score metric. We're going to use a little bit of code here, also from YLOGS, to translate it into a form that we can threshold. UDF schema captures all of the metrics that we've created and registered as UDFs. Then we apply them to our data, creating a new pandas data frame that we'll name annotated chats. This isn't always necessary for profiling your data, but it's helpful in our case because we want to threshold these scores for our evaluation. So this is our evaluate examples helper function that we've used before. And now we want to filter our annotated chats with the threshold of our choosing. I'm going to use this response dot BERT score to prompt. And because it's a little long, I will push this onto the next line. Now what do we compare it with? I say we give a threshold of, let's start with 0.75, less than 0.75. So remember that if we have a low BERT score, this means that we're more concerned that a particular prompt and response may represent a hallucination on the part of the LLM. Because when we have a low BERT score, this means that these two are not similar and that this may be a hallucination. And that's what we want to pass in to our evaluate examples helper function. The last thing I'll do here is. Pass in some scope. So while originally we looked at all of the different types of issues, now we're really focusing on hallucinations. Okay, let's run it and see how well we did. So now let's do another with a different threshold. You can go back and look at the visualization to find an interesting spot. I'm going to stick with 0.6. So now we'll move on from comparing the prompt and response to comparing multiple responses given from an LLM for the same prompt. One place that this became popular is the self-check GPT paper, which is a comparison of the response to multiple responses using a number of metrics, including the ones that we've just used like blue score and BERT score, as well as others. To use this multiple response paradigm, we need to download some new data. Let's call this dataset chats extended, It's in our Chats Extended CSV, so I'll run this. We'll see that Chats Extended has multiple columns now. We still have a prompt and a response, but we also have a response and two more responses, response 2 and response 3. We have a third column that we'll use for our fourth metric. So for this metric, we want to look at sentence embedding cosine distance. So for the BERT score, we calculated word embeddings for each word in the prompt and response. Now we want to graduate to sentence embeddings. We don't have to just use a sentence, we can pass in multiple sentences. So we'll pass in our responses. To calculate sentence embeddings, we'll use a particular model. Let's import the sentence transformers package to do so. Next we need to choose our model. So we'll use sentence transformer, which is open source and free. You can choose any model. We'll pick one that's very popular for the package. To get a sentence embedding, all we need to do is call the model.encode method and pass in our sentence. We get a long embedding. If we want to compare two embeddings, we'll need to calculate a cosine similarity between them. There's many ways to do this, but let's use a utility function from the sentence transformers package. Now, let's put in our decorator, where we're looking at response in the two responses, response two and response three. We'll create a metric called response.sentenceEmbeddingSelfSimilarity. So our decorator needs a function. We can name this function anything. This won't be included in our metric. So inside of our function, we need to translate all of the text into sentence embeddings. So we'll pass in the first response for response embeddings. And we'll do this two more times, the second one for response two, and the third for response three. Now we can decide what we want to do here. We can capture pairwise cosine similarities, so between two, but when we have three we have to be thoughtful. What much of the literature does is it compares the original response to each of the new responses. So our original response will be compared to 2 and our original response will be compared to 3. Finally, we can just return the average of the two. Okay, now let's go ahead and run our function. So here we have all of our average self-similarity scores for the content of our CHATS extended dataset. We see our response similarity metric has an even different distribution from the other two. This time, it's left-tailed. We see many of the values between 0.7 and 1, and a couple of values less than that. This is encouraging. There's not too many hallucinations out in real data sets or in our data set. So to have something with a few on the left means that small self-similarity scores might be true hallucinations. So now that we're comparing responses to other responses, the differences that we capture are much more likely to be about the model. We'd always suspect that there's some differences between prompt and response. So while that comparison is a good analogy, self-similarity across multiple responses are even better. Let's look at which examples have the lowest self-similarity. Let's use our same applyUDFs function to annotate our data frame with the self-similarity scores and the other scores that we've calculated. Let's take a look. Our final metric under consideration is still response self-similarity, but we're going to use the LLM to evaluate itself. So instead of using a formula or model to calculate a score, we're going to send the three responses to the LLM. It can be either the LLM that made the original response, or a different LLM solely for comparing the three responses. So instead of using sentence embeddings, we're going to opt to send the three responses to a model to do the evaluation of how similar they are. The model that does the similarity comparison doesn't have to be the same model that gave the three responses. First, we'll see how to prompt the LLM for the similarity metric. So we'll import OpenAI. Next, we'll import our helper function. Let's add the OpenAI API key. Great. Now that we have the OpenAI key, let's go ahead and look at a template of how we might call OpenAI. Here's the structure, and we want to replace this with the prompt that we can use to compare. Okay, so it's a pretty large prompt. So the prompt we'll use asks for the first text passage, which is the first response. Can the LLM rate the consistency of that text to the provided context, which are the other two responses? The reason that we use the word consistency here is largely a choice. Another word might be similarity or things like this, but we find that consistency tends to be more about whether or not two sentences logically can be true at the same time. Another concept that's very similar is the concept of entailment, which is also about would one sentence logically entail another sentence to be true. So you might notice that we have a couple of variables in our prompt. Let's go ahead and take this prompt and put it into a function. So, I'm going to call this LLM self-similarity and it takes in the data set which should have the response, the response two, and the response three columns and takes in an index. Okay, let's go ahead and run this for one of our rows of data. Turns out I didn't return anything, so let's go ahead and add a return statement to this. Okay, now we see the OpenAI object that comes out and it gives us exactly what we need. So we have this JSON object, but in it, we want to collect this content. So this is the output from the model. One thing that you'll find when you prompt an LLM for very strict information like this is you won't always get the exact format that you wanted. Sometimes you might get a number, but it comes with a full explanation. There's many different tools that you can use to filter out these explanations. We won't go into that here. Since I've already done the work to calculate these values for you, we won't call our LLM repeatedly. Instead, we'll use the one that's located in our chats extended dataset. Now that we know that this works, I suggest that you alter the prompt and see if you can create a similar metric or a better metric. The way that we've done it here isn't exactly as it's done in practice. We're asking the LLM to give a value between 0 and 1 related to the consistency and similarity of the text. One thing that's difficult is getting a calibrated response when we ask an LLM for a number like this. If we ask for numbers between 0 and 1, it's really difficult to understand what a 0.5 might mean or a 0.25 might mean, and those might change depending on slight nuances in your response. or between prompt to prompt. One approach in practice is to actually ask about specific sentences in the response. Is this specific sentence, the first sentence of our response, consistent with the whole second response? Some other ways you might change this prompt, instead of asking for a number between 0 and 1, we may try to calibrate by asking for categorical information, Maybe high, medium, low consistency. Let's create a filter to look at self-similarity scores that are less than 0.8. We'll pass in as our variable the response dot prompted self-similarity. Okay, let's see what we get. Okay. We have prompts here such as this discover credit card issue where some of the responses give a format for the credit card and other responses give some more details about the sorts of numbers that you'll see. Actually we see multiple of these examples where we're asking for some sample data which makes sense, right? The sample data might differ from response to response. This last example is a good example of a hallucination. So we asked to translate some code from Python to a made up programming language Parker, we see in one of the responses, we get a refusal, sorry, but I'm not able to provide that translation. But in other responses, we do get some code. And not surprisingly, that code looks very different from each other because the language doesn't exist. You'll see the self-similarity score for these is 0.00, which seems fair. Now we've explored all four metrics using different comparisons. Now we'll move on to the next lesson, lesson 3, about data leakage and toxicity. See you there.