In this lesson, you'll build an evaluation framework to systematically measure performance. All right. Let's go. so first, what exactly is the goal of evaluating the model? Well, first it's to understand is your MLM really improving or not and be able to have a quantitative response to that that says yes or no that you can rely on. the second goal is where is it hallucinating? where is it still being incorrect? Where you can still improve it? Right. So what is the next frontier of what it can improve on? So evaluation is both about asking the question of did it work? It did what I tried work. And also where can I take it? Where can I take the model as I improve it? A good evaluation is one that is quantitative. So one that you can actually have measurable scores that you can actually, you know, compare A versus B, model A versus model B, say one is good, one is better and one is best. it can also point out what you can improve next. So a good evaluation is not so far out there in the ether that it can't actually tell you what you can actually improve in this next iteration. A good evaluation is really setting up this next frontier correctly. and the last thing is around creating an evaluation that is scalable and automated, and something that is magical about this new era is that you can use L11 to help you with this, so that you can programmatically start to evaluate these models and be able to scale this out to larger and larger evaluation data sets, as well as to more and more models. So what should your initial evaluation data set look like? I think I see a lot of people embark on these epic journeys to create evaluation data sets, but in fact, I always encourage people to start small. Start with 20 examples, 20 to 100. and and keeping the small. The whole goal is to move very quickly, iterate very quickly, and improve one thing so that you see improvement in the model. And then as you see improvement in the model, your evaluation set will adjust. So actually committing to something too large is not is not actually practical. Quality matters more than quantity. so having high quality evaluation is more important because then you can actually rely on your evaluations not noisy. and then being able to do that again iterative expansion is extremely important. And finally focus on areas of improvement such as, you know, those hallucinations, where is the model? Almost get enough, but not quite. where is that failing that you want it to improve on? Again, this is setting some Northstar goals for the model for its next iteration. Here's an example of, an evaluation data set, of ours, for the MBA data set. As you can see, there's actually quite a good set of breath in the questions being asked. You know, who is a point guard for the Golden State Warriors? what is the number of players on the Chicago Bulls who are 25 years old or younger? one good way to think about this is get enough breadth where you think, the real coverage of someone using your LLN would actually need and require, because this is teaching your LLN the entire breadth of what it might see. So what are some practical tips? So that's that's all fine and good. You set up your evaluation data set. how do you actually really set up that next frontier that I keep mentioning? Well, one good tip is to create examples or find examples that are the easiest ones for the LLM, but that they still fail. you know, it's barely almost there. you want it to get there. and so that's a really reasonable next frontier for the LLM to conquer. The next approach is what I like to call an adversarial playground. So hook your LLM up to a playground and try to break it. Right. Like try to ask it questions, get it to break in your setting so that you can understand what the boundaries are. And oftentimes, this is how you might expect your users to be using your model as well. or someone else evaluate it. it's a good idea to actually put it through the adversarial playground. And finally, set a next accuracy target for your LLM. What do I mean by this? Well, one kind of good rule of thumb is think through how many nines essentially you want in your LLN. Right. Like how accurate do you really need it. And then start with a really, really down scoped, very focused evaluation data set and go from maybe the base accuracy is like 20% go to 20, 30, 40, 50, 60. Get to that 90, 95% accuracy, 99% accuracy, whatever that target is. Get there first so you believe it can happen on a small evaluation set. And then make your evaluation harder after that. And so then the accuracy drops back down to 20. And then you move your way back up to 9095, 99. Whatever your actual accuracy target is. So how do you actually score your output? I mentioned Lmms. So you can get the LM to output a score, of what? it generated previously. This is a different LM call, of course. and what you provided is the question, the generated answer. In this case, whatever the input output is, in the scoring method through the prompt of your LM, and one way to return, a reliably numerical score is to use structured output. So you can specify that the score should be an end to float a list of ints that need to match, maybe like a product ID or something like that. So a way to score the model, in a way that's quantitative and guaranteed to be quantitative. you can also, of course, calculate, traditional exact matches between the output and the reference output, of your evaluation. So as an example, you can see here, the system prompt says compare the following two data frames. If they're similar, they're similar. If they're almost identical, are the conveyed the same information about the NBA roster data set and respond with a valid Json explanation string and then similar bool. So responding with that bool kind of like a score. zero and one. And then you give it to the both a reference data frame as well as the generated data frame. And you can see here those are the outputted ones, down here with Steph Curry and salary. and then you ask the can you tell me if these data frames are similar and I can say, yes, these are similar. so that's an example of sequel. Let's actually go one level deeper sequel. so that you understand for for this particular use case, you can actually get pretty deep with your application. So eval is pretty application specific. so score similarity of, of that generated to reference sequel. That's exactly what you just saw. The generated sequel, can be a semantic match, but not always an exact match. And just to bring that example back again, you can see here that the salary and name are inverted. So if you were to do an exact match on these two data frames, it would have said false, when in reality this is totally fine for for the output of the sequel. It's correct. so you can actually be able to do essentially fuzzy matches with Lims. now, it's also a pretty valuable to, leverage deterministic methods for evaluation. What I mean by that is not using an algorithm so you can execute the generated SQL against a database. and then you can compare exact match, but executing against database will help you understand whether, a query is even valid or not. so if it's invalid, it's obviously incorrect. And then one bonus point with memory tuning that you'll explore soon is that you can actually get away with an exact match, evaluation criteria. If you teach the model to output, a certain format or output, certain types of exact matches to make it easier for you to do evaluation. So note that as you explore, evaluation and you iterate on, on the model, you can actually update what your evaluation criteria are. these iteration cycles go go very much hand in hand. Speaking of iteration, as you improve it, you'll continue to expand that evaluation data set. It'll have more breadth. It will cover more things like maybe you play with in the adversarial playground. it'll be able to cover harder examples. you will also want to improve your scoring mechanisms. Start with something easy. but improve how you're actually evaluating the model. whether it's those deterministic systems, like being able to do exact match or checking if it's valid SQL or it's using lemons and different prompts to improve what the output score that is. Maybe having a bool isn't the way to go, maybe it's actually evaluating the SQL query across many different dimensions, not just whether the output is similar or not. Maybe you care about the efficiency of of the SQL query coming out, adding harder, harder examples as you iterate. Again, setting that next frontier incredibly, incredibly important. And then being rigorous. So this is more of a process tip, be rigorous about tracking which models actually produce which results because you're going to create a lot of different iterations. Sometimes they're going to be small iterations. And naming these things can be tough. you know, in general with, computer science and software engineering, it can be tough. but here it's it's incredibly important to be able to keep track. Cool. So now we'll move on to the lab where you'll get to explore this first hand. Cool. So first off, let's just load that API key. And again so next we're going to create evaluation data set. so that is already created for you in this case. let's cut that file. Great. Here are some results from that. you know, asking what's the 99th percentile salary in the NBA? What's the median weight in the NBA? Let's just take one of these questions and, evaluate it. And take a look at it. You know, what does the model actually say. So that was the evaluation set. And those are the correct examples. So now let's see what the model says. So question equals this. but first before we do anything more with the models we want to import lamb and I I'm going to import, couple other util functions that we went over before. So make prompt for the for llama three and then get the schema, and then instantiate the model as next. and let's make the system prompt together. So that's the system prompt. And then making the prompt here. Okay. So that's median weight of the NBA. Now let's actually get the model to output it with again that structured output here. Let's try to write that. Great okay. So we have this generated query from the model. it definitely looks a little off. But let's see. Let's see. so let's run it through the database engine. So let's import all this stuff connected to the database. And I'm curious to run it here. See. Does it work. No. Exactly. So it's invalid SQL right. So it's invalid. It's not going to run. But the model generated. So instead I'm probably want to you know do this like this. Have a try accept statement. and see what the error is. So syntax error okay. so that's an example of the model generating SQL query that does not match the, evaluation. the same example in, in evaluation. And you could catch that with a lot of different methods. Okay, so you might be wondering this didn't run, but maybe we didn't do enough prompting. Maybe we didn't actually string together more calls on the model that we could have. Let's try agent Reflection. That's been very popular. Let's see how well that works. So maybe instead the prompts, you know, the the prompt could be a reflection is the question. so we have that question above there. we can have that query. So I think it's generated generate a query. Let me just check. Yep. Generate a query. Cool. Specifically that SQL like query. Great. and then we can say this query is an all right. So maybe we were able to check that this query is invalid. And let's actually, give it that error so it's invalid. get the error. Let's just paste it into the model. The error. This. Great. Let's double check. There's no more double quote. Cool. So we have that. so that's one parenthetical. So it cannot answer the question. Right? A corrected SQL like query okay. And let's actually make a prompt with that reflection prompt. Let's take a look at what that actually output looks like. So look reasonable. Great. So it still goes through that let's actually generate again see what the model output. So reflection query and reflection very. Okay. And again it is incorrect. It did the same thing. But we can actually run the try except again if we want. So instead of that reflection query it failed again. Okay. So let's actually take a look at what the right answer would produce. So the corrected SQL, habit here is this statement. just took a look at it. There we go. Let's take a look. That's the correct one. And when we run it through the SQL engine it's able to actually output a data frame. Great okay. So this is actually exciting. This is a very difficult right query to get right. Even though it sounds really simple. this is a good example to put in your evaluation data set. So next let's actually evaluate much more comprehensively across many more data points. for Texas SQL for the base llama three. And you're going to do this for every iteration of the model. So just this is how we're going to scale it now. And before we get into the code, the next block of code, these 20 queries or so for this MBA data set for this evaluation, initial evaluation set, it took about 20 minutes to write and get to the you know, the right number. And the jump in accuracy was, you know, from something that's closer to 20, 30%, up to 75%. So it's it really is worth, investing that time and energy. And later in this, you'll see a more intense kind of hour long, more workflow, of updating that evaluation. set, that's improved the model accuracy from 75% to 95%. So, let's take a look at how to evaluate it much more comprehensively. I'm going to load a lot of arguments and imports in first that kind of support us in doing all of this. So there are a lot of classes, as you'll note, that you can take a look at and in these args, just note that it's using llama three 8 billion. And it's loading this gold test set as before. So let's run that. So let's show loading this data set. and we want to That's a function that loads this here. We're actually reading it in and let's just go through a data point or so. so data point. So you look through all the data. I think previously the one on weight was for like just make sure of that. Cool. So average weight. Okay, so that's a data point from before, and we saw that that was an invalid generated SQL statement. So I'm just going to pick up a different data point in this data set. And let's take a look at this one. So can you tell me how many players are in the NBA. And the correct answer is 600 so let's again run through that. So here's a question. So let's actually run through that through the model and see what it outputs okay. It outputs a SQL query here. And when we run it this one actually does execute even though it's the wrong number. Right. Set the wrong number here. So you know some of these queries they're they're valid. They're not. So really you want to wrap this up into some kind of function to be able to evaluate systematically across all these data points. so you could say this query succeeded is false. And do a try except right. and in this case the query is in fact valid. And then you want to do the same with the reference query. obviously hopefully the reference queries you already checked are all valid. but you can run reference SQL through as well. So the reference SQL here, is SQL that you want to index into. And run it like that. Great to run. So the database confirm that 600 there. so what you want to do here is then collect the generated SQL statement, the reference SQL statement and then whether the generated one was valid or not. And so we're going to wrap that all all in a runnable class. I'm just going to paste it here. There's a lot going on here okay a lot going on here. but the nugget the important part is what you just went through. So you can run both the generated and reference here and then assess whether they're valid. So they're running the reference. Right. here was running generated and seeing if it failed that try accepts statement. And then here is just making the prompt. So all things you know cool. So that's just wrapping it up in a class. Okay. So that's running understanding, you know whether the SQL statement is valid or not. that's one deterministic way of understanding, how the model is, is doing. Another thing you can do is, check if the data frames, you know, the reference data frame in the generated frame are actually the same. So I'm going to do an exact match lower and now they're not the same. obviously, because they're different numbers there. next you can also add an LMS judge of similarity. So maybe they were similar in some other way. and Let's actually look at the system prompt that we're creating here. So you're comparing the two data frames. they're similar if they're almost identical or they can be the same information. and asked to respond with valid Json, explanation string, and similarity or similar ball. So be able to respond like that. Next you'll want to add the two data frames. So say the you know in the user prompt some data frame one and two. That's the generated one and the reference one. And then asking, can you tell me if these are similar? Right. Instead of this, perfect, string match here? Can you tell me that these are similar? let's make that prompt and let's generate from the model. And note that here the output type is kind of what we're hinting at here. So similar we want to ball out and we want this explanation first. This is actually just worth a small note. This is how you do chain of thought, using this strict output type. So you're asking the model to give an explanation first think about it and then be able to produce, similar bool here. So let's run that. and then let's take a look at what the of similarity is. Great. So it has an explanation and it says you know for 76 not the same as 600. So similar is is false. Okay. So how would you update how you check if something is similar or a perfect match. You could write a function that looks like this. You have that match their, exact match. And then or whether you know a similarity similar that bool is also true. obviously it's it's false in both cases in this case. But you can you can do that now let's wrap it all in a class. Again, this will look a little intimidating at first, but actually you already have seen all the important pieces. So let's see. So you're just pre-processing post-processing those outputs. and really here's on nugget right of making that prompt, putting it through the model and then checking if matching. So whether those are exact match or that similar is in fact true. Great. so now you have you have this score state, right? And you had that query stage above. And so how do you how do you put that all together in your evaluation pipeline. So let's take a look at what the code for that is. It really is just wrapping it all in a pipeline. So this is what that pipeline could look like. You can instantiate that query a stage, that score stage. that's why it's valuable to wrap it up in a class. And then when you go forward, it's, first do the query right. So first check if it's valid or not and saw that and then and then score it with the alarm. I'm going to add some wrapper functions here, to run that eval and run that evaluation pipeline as well as save those results with a nice progress bar. and then let's see, let's also, save those evaluation result. Incredibly important. Let's actually take a look at some of, what I'm saving here and some pro tips you could take on. So, I think a time is actually extremely helpful in AI in general as you run different iterations. So you can go back to where you ran something and actually pointed to the right version, experiment name, etc.. here is checking if it was correct. So it's being able to save those results directly. And this is just saving dictionaries. Okay. it's worth taking a look at, you know, everything that's being saved here from the total size of the eval data set to, the percent that were valid SQL syntax, percent correct SQL query, you know, to understand how the how the model is doing along both those dimensions. Okay. So the next is the exciting part. you'll actually be able to run it. So instantiate those args from before really just getting lumps re in there loading your test set and then running those results. So let's let's take a look. And here are the result as you go through. And you can see that there are definitely some invalid SQL queries coming out. And I'll just paste what that error is, with 20 results across your entire eval data set. Right. the total size was 20. And then the percent of valid SQL syntax was 55%. So 55% of the generated SQL was valid. and then only 30% were correct. so as you can see, some of the SQL queries were valid, not terribly many, but not terribly few either. And then even fewer, obviously correct. within the valid SQL queries. And then you can take a look at the data slash results directory where there is a saved folder with your experimental, arguments and results. and so this is just the start and an exciting start. You'll actually be able to improve this, in the next iteration in the next lab. in the next lesson, you'll learn about how to fine tune, specifically memory tune, these models and how to do so very, very efficiently. So very exciting to get to improve your models next,