In the last lesson, we learned about using LLMs by API, but sometimes you might need to train a completely new model or fine-tune an existing one. We'll talk about it in this lesson, with a focus on debugging and evaluation. Let's see it in action. Training LLMs from scratch can take a long time and cost a lot. Evaluating them is also complex and resource-intensive, so it's important to keep a close eye on the training process and use checkpoints to deal with any unexpected problems. You can get valuable information from the dashboard, which shows the training progress, metrics, and helps you get the model checkpoints if you need them. Fine-tuning methods let you refine an LLM more economically, even if we have limited computing power. But you still need to be careful during the evaluation process. Depending on what you want to achieve with the LLM, you might need to come up with specific evaluation strategies. Let's look at the code together. Here we'll show you how to fine-tune a language model using Hugging Face. To do this efficiently on a CPU, we'll use a small language model called TinyStories, which has 33 million parameters. We'll fine-tune this light model on a dataset of character backstories from the Dungeons and Dragons gaming world. As usual, start with imports and login. Then set the model checkpoint to pull. We'll pull a dataset from the Hugging Face hub. Looking at this example, we can see the dataset has two columns. This text column asks the model to generate a backstory. And this target column holds the generated backstory of the character. We'll set up the dataset split so we have a validation set. And then before training the model, we'll combine and prepare the instructions and stories, making sure they're tokenized and padded. We'll also create labels, which are identical to our inputs. Labels need to be shifted by one position, because the model should predict the next token in a sequence, and this will be done by Hugging Face. Now let's try generating a sample to ensure everything is working fine. When we decode the output, we'll see the instruction, followed by the generated backstory. If it all looks good, we can proceed. Before we get into model training, let's check out an example from that dataset to see what it looks like. So here it's taking character name Mr. Gale and character race Half-Orc and then giving us this output story. Growing up, that only Half-Orc in a small rural town was rough. So that's looking good, and it seems like that's a good input for a model. Now let's get into model training. We'll use the Transformers trainer from Hugging Face, and demonstrate the very simple integration with "wandb". The model we create is for causal language modeling, which means an autoregressive language model architecture, similar to GPT. That just means predicting the next word in a sequence. We'll start a new Weights and Biases run, and now the job type is going to be Training. Next, we'll define some training arguments, like number of training epics, learning rate, weight decay, and crucially, we'll set report to "wandb". That means all of your results will stream to that same central dashboard. That's all you need to do to start streaming metrics. Let's start training the model. Now I don't want to wait for the training to finish to start seeing my results, so I'll scroll back up to the WMB run. And I can click this link to see results live. Here, I can see the metrics streaming in over time. And training loss is the most interesting to me. This is the metric I'll watch over time. When debugging a model training run, it's useful to check to make sure the loss keeps going down. So you want to see this curve go down and to the right. Some very large language models could take days or even weeks to train. And so it's helpful to have a chart like this that you can watch remotely. This helps us make sure that the model continues to improve and we're not wasting GPU resources. Great, so training is finished and that was a very small model, only training for a single epoch for efficiency. So the results aren't gonna be perfect. If you're interested. I'd encourage you to try to improve them, maybe by running for longer or tweaking the hyperparameters. Now, the training is complete, so let's generate samples from the model. Back in the notebook, we define several prompts and use them to generate backstories for our characters. After this, we create a new table, and for each of the prompts, we'll call "model.generate". We can pass various parameters here, like "top_p" or temperature, to steer the model. We'll add our generations to the table, log it, and finish the run. Now let's look at the results in our dashboard. Here I'm looking at the results in the project page, and I can expand that table to see some samples. Here in the table, I can look at the prompt and some sample-generated output. This prompt is for a character named Frogger, and the generation says, Frooborn is a small dragon who lives in the woods. His mother was a dragon, a small dragon, a small dragon, a small dragon, a large dragon, a small dragon. That seems like maybe it got a little stuck on the dragon idea. We have a smarty character and what happened to him? He was born in the city of a wealthy merchant. He was a young boy, but he was not a good man. And finally, we have another example for Volcano, who's an android. And for this character, the output is just the tribe of the tribe of the tribe of the tribe. That doesn't seem very good as a backstory. So you can see there's some issues with this small model. And that makes sense. We were optimizing for speed over performance. You can see here how important it is to use qualitative evaluation when training generative AI models. Because just looking at these messages, these outputs, you can see whether or not it's doing a good job. We encourage you to come up with metrics that may be relevant for your specific use case, then implement them and also log them along with generations. For example, you could measure something like number of unique words. And in this output, we could see it's only really using three words, the tribe of. So maybe that's not a very good output. So the next time you train or fine tune a model, we hope you can use these tools to get better results faster.