and responses that we'll use throughout the course. You'll detect issues such as data leakage, prompt injections, and hallucinations with techniques that we'll explore in greater detail in later lessons. Let's take a look! So, what are we doing here? We're going to develop metrics to help us look for problematic prompts or responses in our LLM application data. So, it's almost like you're a professional fly catcher, and you're trying to find the right net to catch different types of flies or bugs. Some of those metrics are going to be very simple, starting from scratch, because even those are used in practice. But we'll also recreate some of the state-of-the-art metrics in later lessons that have been discovered in the last year or so. We'll use all of them to find rows in our data that contain issues, and then evaluate them to make sure we've captured the phenomena that we're looking for. We'll start with some setup. We'll import a helper module that I've created to give us some visualization and data exploration tools that we can use, as well as evaluation of our metrics. Next, we'll import pandas. I've created a data set with user prompts and LLM responses and labeled them as normal or having issues such as refusals, jailbreaks, hallucinations, toxicity, or data leakage. You'll notice that chats is in the folder above the current one we're in. Let's look at a couple of rows from our data set. You'll notice the data set has a prompt in a response column. These are prompts and responses collected from an LLM, particularly OpenAI's GPT 3.5 Turbo. We'll use these for our evaluation. The dataset isn't representative. They'll have a lot of these special cases that we're looking for. Because we can't see the full prompts and responses for some of the text, we'll use a Pandas setting to display the full column width. Now, we can see the full text. We'll use Ylogs, an open-source data logging Python library made to capture machine learning data. Let's import it. In order to see visualizations together, we'll call y.init. The parameter is just so you don't have to enter a username and password. For metrics specific to text and LLMs, we've released the open-source lankit package that runs on top of Ylogs. Both Lanket and YLogs use a schema object that defines which columns to summarize and which metrics to calculate. We'll call this one LLMSchema. LLM metrics use a number of language models, so we'll see those download. So now, let's log our data using our LLMSchema. So, we use the y.log command. The first thing we want to pass in is our data, so chats, our pandas data frame here. The next thing we want to do is give it a name. So, I'll call this LLM chats data set. Finally, we want to pass in that schema, schema equals schema. Once the data is logged, we get a nice link to click and see the visualization, and we can confirm that we have 68 rows of data. This is the Insights and Profiles page, where we can see a number of metrics that automatically get collected with the LLM metrics setting. When we click on Show Insights, we can see some helpful tips to understand our data better. For example, we see that we have at least one negative sentiment prompt, and that we have pattern matches in our dataset that relate to often data leakage such as mailing addresses. So, at a high level, a hallucination is just an LLM's response that is either inaccurate or irrelevant. You may be familiar with cases where LLM's responses are factually inaccurate, but even if the answer is correct, it may be irrelevant. For example, if you ask an LLM for a cookie recipe and it gives you a recipe for a birthday cake, that's correct, that's also a hallucination. Irrelevant responses happen when you're asking the LLM something it doesn't know the answer to. Kind of like when I was a kid taking exams in school. Sometimes when I forgot to study, I'd write long responses to the question with whatever I remembered from studying, even if it wasn't directly answering the question. Another quality of a hallucination is that often they look realistic. If an LLM outputs a bunch of nonsensical text, it's pretty uncommon to call that a hallucination. A hallucination looks readable and coherent and looks like it could be a valid response to the prompt. Hallucinations are really interesting because they're hard to measure and there's many different ways people have proposed to measure them. We're only gonna look at two in this course. Right now, let's look at prompt response relevance. A common way for practitioners to measure relevance is by looking at how similar the response from an LLM is to the prompt it was given. We use the cosine similarity of the sentence embeddings to do that in LaneKit. We'll import the input output module from LaneKit. We'll use one of the helper methods to help us visualize this. First, we'll pass in our data set, chats, and then we'll pass in the name of the metric that we want to use. In LangKit, this is response.relevanceToPrompt. Okay, so now we can see the distribution of this new metric we've calculated. Low scores near zero are more likely to be hallucinations. For now, we'll use this helper function, but later we'll dive into some of the methods that are used inside of it. This next helper function shows us a few of the examples that were most likely to be hallucinations. In row 48, we see an interesting example where we do have some words that are similar, like cow and moo. They'd be near each other in semantic space, but the sentence, moo, has one word in it. So, this is really difficult to capture with a lot of similarity metrics. And this isn't foolproof. Semantic similarity is related to, but not the same as relevance. Say we have a prompt like, what happened to the Roman Empire under Julius Caesar? If the LLM response is, the empire was Roman and Julius Caesar was a person in the Roman Empire, and Caesar salads are delicious. The response may be semantically similar by using a lot of related words, but it isn't really answering the question directly. So, that might be considered an irrelevant response, even if the response looks similar to the prompt. But then. there's also the opposite, right? So, sometimes you'll ask a question, and a correct and good answer doesn't necessarily use the same language. For example, if I ask, What noise does a cow make and ask for a one-word response, like we see in our data set? If the LLM responds with moo, that's a great answer and it's a relevant answer, even though it's not semantically similar in the text. Prompt response relevance isn't the only metric that we can use for hallucinations. In fact, we'll look at some more advanced and more recent discoveries, such as response self-similarity, like self-check GPT, where we ask an LLM for multiple responses to the same prompt, and compare similarity between those responses. If an LLM is saying something different every time it's asked the same question, it's more likely to be hallucinating. We'll explore these approaches further in the next lesson. So next, we're going to look at data leakage and toxicity. A common approach for data leakage is still string pattern matching with regular expressions. It works well even in advanced applications. Phone numbers, email addresses, and other personally identifiable information tend to have a lot of structure that lends well to Regex. We're going to import lane kit metrics for data leakage. Now, we'll use the same helper function to visualize the metric it creates. We see email addresses, phone numbers, mailing addresses, and social security numbers in our data set. We can do the same for response. In the responses, we also see credit card numbers. So now, let's move on to a different metric, toxicity. Toxicity can include a number of different things. The primary thing we think of is explicitly toxic language such as race, gender, bad words, malicious words. We'll use the same helper functions to visualize the toxicity metric for prompts. We can see prompt toxicity is really long tailed. Most of the toxicity is at zero and only a few have higher values. We see a similar trend for response toxicity. So, you sometimes see an LLM respond with, sorry, I can't answer that, or I can't help you with that request. This is a refusal where the LLM detects that the prompt may ask it to do something it's not programmed to do, so it provides a non-response. So, there's a cat and mouse game where a hacker may try to get around these refusals with clever prompting, to trick the LLM into giving information it should normally refuse to do. This type of prompting attempt is called a jailbreak. A jailbreak is a particular type of a prompt injection. Prompt injections refer to any prompt that tries to get the LLM to do something its designers did not intend for it to do. After we've imported our injections module, we'll use our helper functions to visualize the metric. It's worth noting that the injection metric name will be upgraded to prompt.injection in future versions of LearnKit. If you look at the distribution of jailbreaks, you'll see lots of near 1s and 0s. That's because the model is pretty confident in many examples. This particular dataset over-represents jailbreaks for learning purposes, but these would normally be very rare in real-world data sets. Now, let's look at the examples that are most likely to be prompt injections. We can see here very complicated prompts, prompts that have lots of redirections such as, I am a programmer, and please answer in certain ways. Now, we'll evaluate our metrics for security and data quality. As we build metrics in this course, we'll want to check how well we're doing at detecting problematic examples. To do this, I made a dashboard using. YLOGs that we'll use to see how well we're doing. To use it, we just pass in the examples that we believe are problematic. You see here that we're still failing all of our objectives except for one. In our final objective, we just need less than five total false positives. Because we haven't passed in any data, We definitely haven't gotten five yet. Now, we can try a simple metric that looks for certain words in the response, such as sorry. Let's see what examples we get from that and pass it in. First, we'll filter our chats down to those containing the word sorry. Then, we can pass this into our evaluator. Let's take a look. Now, let's pass it into our evaluator. Which of the issues that we'll cover in the course do these examples look like? So, we see that we passed one of our constraints that we hadn't before. We found all of the easy refusal examples just with the word sorry. But we still have more difficult examples to find using more advanced methods. I encourage you to try new filters to see if you can find different problematic examples in the data. For instance, you can try filtering for examples with long prompts, perhaps more than 250 characters long. Looking at the filter chats, you may have an idea of which sort of issue this brings up. The next lessons are all about discovering and creating new metrics to identify these issues and get all of those tests green.