Fantastic. You'll be now seeing all the steps required to pre-train our model. But once you finished your training, how do you know how good your model is? In this final lesson, you'll explore some of the common evaluation strategies for LLM's, including some important benchmark tasks that are being used to compare the performance of different models. Let's take a look. As you know, we cannot improve performance if you cannot measure it. Evaluation is like a test. Sometimes taking a test is a pain, but it's necessary to know where you are doing well and then where you need to improve. To evaluate your model, there are four common methods. You're taking the loss. You see the results. You can compare the models, and using benchmark datasets. These are not one-time efforts, but a more continual process. The first method is overdriving the loss during the training. If training is going well, it should be decreasing over time in a consistent way. If it's not, you might want to reconsider your training hyperparameters, especially the learning rates. You might also want to reconsider your model architecture. If you observe a flattening of the loss curve, also known as a situation, there could be a few problems. One is that your model may have reached its maximum learning capacity and cannot learn more from additional data. This could happen when training the small size model. It could also indicate a problem with your training datasets like a local data examples. So definitely check your data if you are training a big model and see the situation. When training an LLMs, checking the outputs of the model yourself during training is really important. So, you can create the checkpoints of the models periodically, say every 10,000 training steps and they make sure that the model output is what you expect. You should be getting better at generating text over time. Human evaluation is still really important for the LLMs, so don't overlook this step. Comparison with other models can also be very useful when you have two results side by side it is much easier to tell which one is better. There are some online tools available to compare models, including Upstage Console and an LMC from Berkeley. You will see the links to this in the notebook. This is one of the best ways to benchmark for performance, but it does require human effort. One active area of research is using an LLM judge to decide which model outputs is better, but that can be biased depending on how the judge model was trained. The most common evaluation method is to use benchmark datasets. This is like having the LLM take a standardized exams. By testing all models on the same benchmarks you get a fair comparison of the abilities and the performance of the different models. There are a few famous benchmark datasets like ARC, MMLU, HellaSwag, TruthfulQA, Winograde and GSM8K. These measures the general abilities of names like reasoning, common sense and the mathematical skills. But depending on how you want to use the LLM, you may be more useful to measure compositional abilities. Benchmarks like MT Bench, EQ Bench, and IFEval have been developed for this purpose. Ultimately, when evaluating the performance LLM you have to make a decision that is the best for your model and use cases, let's head to the last notebook where you will learn how to download and run some of these benchmarks on your model. Here, we will use a popular open source evaluation library called LM Evaluation Harness from the Eleuther Project. It is already installed in the learning environment, but if you need it in other environments, you can always run this line of code. The harness serves many, many tasks for evaluation, and among them, we will choose to evaluate tiny Solar on the TruthfulQA MC2 task. The MC2 task is comprised of multiple choice questions developed by the University of Oxford and OpenAI, and is one of the evaluation tasks included in the HuggingFace open LLM leaderboard. The MC2 task works as follows. Given a question and multiple true false reference answers, the score is the normalized total probability assigned to the set of true answers. We will run this evaluation on a CPU, as always. and run only five examples, so that we can end our evaluation within ten minutes. Note that evaluation tasks take quite a long time, because the log likelihood is calculated for every candidate We'll speed up the video here for your convenience. Now you're done with the evaluation. In this final table, you can see that the final score is around 0.4. Remember, that our model is very, very small, so it is unfair to compare with other models that have more than several billion parameters. You can use the code here to run the whole evaluation suite. If you have trained your own model and want to be on the HuggingFace leaderboard, this custom function here will run all 6 tasks involved with the corresponding few shot parameters. We're not going to run it here because it takes too much time for a short course. The few shot parameter indicates how many examples are included in the prompt that the LLM responds to. So there will be 25 examples included for the art challenge. Note that these are fixed numbers for the HuggingFace leaderboard, so all models are tested in the same manner. I hope if you train your own model, that you'll evaluate it on these benchmarks so that the community can see how great your model is. Hope you have fun! So in the notebook you saw a few ways to run benchmarks by hand. Luckily, there are tools available to make evaluation much easier. For example, Evalverse is an open-source project that we created at Upstage to support your LLM evaluations. We also use it internally at our company. Evalverse unifies various LLM evaluation frameworks using git submodules, and it is fairly easy to set up and run. You can run Evalverse on several models at once, and it will provide a comprehensive report of their performance with benchmarks, scores, ranking and other criteria. Lastly, you will probably fine-tune and align your model after finishing your pre-training so you can use it in real applications. Don't forget about assessing the model behavior in alignment with human values. Benchmarks for truthfulness, safety, fairness, and so on, are also available and are still in active development. You can check out the link in the notebook for more information.