In this lesson, you'll learn about where fine-tuning really fits into the training process. It comes after a step called pre-training, which you'll go into a little bit of detail on, and then you'll get to learn about all the different tasks you get to apply fine-tuning to. Alright, let's continue. Alright, let's see where fine-tuning fits in. First, let's take a look at pre-training. This is the first step before fine-tuning even happens, and it actually takes a model at the start that's completely random. It has no knowledge about the world at all. So all its weights, if you're familiar with weights, are completely random. It cannot form English words at all. It doesn't have language skills yet. And the learning objective it has is next token prediction, or really, in a simplified sense, it's just the next word prediction here. So you see the word wants, and so we want it to now predict the word upon, But then you see the LLM just producing "sd!!!@". So just really far from the word upon, so that's where it's starting. But it's taking in and reading from a giant corpus of data, often scraped from the entire web. We often call this unlabeled because it's not something that we've structured together. We've just scraped it from the web. I will say it has gone through many, many cleaning processes, So there is still a lot of manual work to getting this data set to be effective for model pre-training. And this is often called self-supervised learning because the model is essentially supervising itself with next token prediction. All it has to do is predict the next word. There aren't really labels otherwise. Now, after training, here you see that the model is now able to predict the word upon, or the token upon. And it's learned language. It's learned a bunch of knowledge from the internet. So this is fantastic that this process actually works in this way and it's amazing because all it is is just trying to predict the next token and it's reading the entire Internet's worth of data to do so. Now okay maybe there's an asterisk on entire Internet data and data scraped from the entire Internet. The actual understanding and knowledge behind this is often not very public. People don't really know exactly what that data set looks like for a lot of the closed source models from large companies. But there's been an amazing open source effort by EleutherAI to create a dataset called The Pile, which you'll get to explore in this lab. And what it is, is that it's a set of 22 diverse datasets scraped from the entire internet. Here you can see in this figure, you know, there's a four score and seven years. So that's a Lincoln's Gettysburg address. There's also Lincoln's carrot cake recipe. And of course, also scraped from PubMed, there's information about different medical texts. And finally, there's also code in here from GitHub. So it's a set of pretty intellectual datasets that's curated together to actually infuse these models with knowledge. Now this pre-training step is pretty expensive and time-consuming, it's actually expensive because it's so time-consuming to have the model go through all of this data, go from absolutely randomness to understanding some of these texts, you know, putting together a carrot cake recipe while also writing code while also knowing about medicine in the Gettysburg Address. Okay, so these pre-trained base models are great and there are actually a lot of them that are open source out there, but you know, it's been trained on these data sets from the web and it might have this geography homework you might see here on the left where it asks what's the. What's the capital of Kenya? What's the capital of France? And it all, you know, in a line without seeing the answers. So when you then input, what's the capital of Mexico, the L line might just say, what's the capital of Hungary? As you can see that it's not really useful from the sense of a chatbot interface. So how do you get it to that chatbot interface? Well, fine tuning is one of those ways to get you there. And it should be really a tool in your toolbox. So pre-training is really that first step that gets you that base model. And when you add more data in, not actually as much data, you can use fine-tuning to get a fine-tuned model. And actually, even a fine-tuned model, you can continue adding fine-tuning steps afterwards. So fine-tuning really is a step afterwards. You can use the same type of data. You can actually probably scrape data from different sources and curate it together, which you'll take a look at in a little bit. So that can be this quote unquote unlabeled data, But you can also curate data yourself to make it much more structured for the model to learn about. And I think one thing that's key that differentiates fine-tuning from pre-training is that there's much less data needed. You're building off of this base model that has already learned so much knowledge and basic language skills that you're really just taking it to the next level. You don't need as much data. So this really is a tool in your toolbox. And if you're coming from other machine learning areas, you know, that's fine tuning for discriminative tasks, maybe you're working with images and you've been fine-tuning on ImageNet, you'll find that the definition for fine-tuning here is a little bit more loose and it's not as well defined for generative tasks because we are actually updating the weights of the entire model, not just part of it, which is often the case for fine-tuning those other types of models. So we have the same training objective as pre-training here for fine-tuning next token production. And all we're doing is changing up the data so that it's more structured in a way, and the model can be more consistent in outputting and mimicking that structure. And also there are more advanced ways to reduce how much you want to update this model, and we'll discuss this a bit later. So exactly what is fine-tuning doing for you? So you're getting a sense of what it is right now, but what are the different tasks you you can actually do with it? Well, one giant category I like to think about is just behavior change. You're changing the behavior of the model. You're telling it exactly, you know, in this chat interface, we're in a chat setting right now. We're not looking at a survey. So this results in the model being able to respond much more consistently. It means the model can focus better. Maybe that could be better for moderation, for example. And it's also generally just teasing out its capabilities. So here it's better at conversation so that it can now talk about a wide variety of things versus before we would have to do a lot of prompt engineering in order to tease that information out. Fine tuning can also help the model gain new knowledge and so this might be around specific topics that are not in that base pre-trained model. This might mean correcting old incorrect information so maybe there's you know more updated recent information that you want the model to actually be infused with. And of course more commonly you're doing both with these models, so oftentimes you're changing the behavior and you want it to gain new knowledge. So taking it a notch down, so tasks for fine-tuning, it's really just text in, text out for LLMs. And I like to think about it in two different categories, so you can think about it one as extracting text, so you put text in and you get less text out. So a lot of the work is in reading, and this could be extracting keywords, topics, it might be routing, based on all the data that you see coming in. You route the chat, for example, to some API or otherwise. Different agents are here, like different agent capabilities. And then that's in contrast to expansion. So that's where you put text in, and you get more text out. So I like to think of that as writing. And so that could be chatting, writing emails, writing code, and really understanding your task exactly, the difference between these two different tasks, or maybe you have multiple tasks that you want to fine-tune on is what I've found to be the clearest indicator of success. So if you want to succeed at fine-tuning the model, it's getting clearer on what task you want to do. And clarity really means knowing what good output looks like, what bad output looks like, but also what better output looks like. So when you know that something is doing better at writing code or doing better at routing a task, that actually does help you actually fine-tune this model to do really well. Alright, so if this is your first time fine-tuning, I recommend a few different steps. So first, identify a task by just prompt engineering a large LLM and that could be chat GPT, for example, and so you're just playing with chat GPT like you normally do. And you find some, you know, tasks that it's doing okay at, so not not great, but like not horrible either, so you know that it's possible within the realm of possibility, but it's not it's not the best and you want it to much better for your task. So pick that one task and just pick one. And then number four, get some inputs and outputs for that task. So you put in some text and you got some text out, get inputs where you put in text and get text out and outputs, pairs of those for this task. And one of the golden numbers I like to use is 1000 because I found that that is a good starting point for the amount of data that you need. And make sure that these inputs and outputs are better than the okay result from that LLM before. You can't just generate these outputs necessarily all the time. And so make sure you have those pairs of data and you'll explore this in the lab too, this whole pipeline here. Start out with that and then what you do is you can then fine tune a small LLM on this data just to get a sense of that performance bump. And then so this is only if you're a first time, this is what I recommend. So now let's jump into the lab where you get to explore the data set that was used for pre-training versus for fine-tuning, so you understand exactly what these input- output pairs look like. Okay, so we're going to get started by importing a few different libraries, so we're just going to run that. And the first library that we're going to use is the datasets library from. HuggingFace, and they have this great function called loadDataset where you can just pull a dataset from on their hub and be able to run it. So here I'm going to pull the pre-training dataset called the pile that you just saw a little bit more about and here I'm just grabbing the split which is train versus test and very specifically I'm actually grabbing streaming equals true because this data set is massive we can't download it without breaking this new book so I'm actually going to stream it in one at a time so that we can explore the different pieces of data in there. So just loading that up and now I'm going to just look at the first five so this. It's just using iter tools. Great. Ok, so you can see here, in the pre-trained data set, there's a lot of data here that looks kind of scraped. So this text says, it is done and submitted. You can play Survival of the Tastiest on Android. And so that's one example. And let's see if we can find another one here. Here is another one. So this is just code that was scraped, XML code that was scraped. So that's another data point. You'll see article content, you'll see this topic about Amazon announcing a new service on AWS, and then here's about Grand Slam Fishing Charter, which is a family business. So this is just a hodgepodge of different data sets scraped from essentially the internet. And I kind of want to contrast that with fine-tuning data set that you'll be using across the different labs. We're grabbing a company data set of question-answer pairs, you know, scraped from an FAQ and also put together about internal engineering documentation. And it's called Lamini Docs, it's about the company Lamini. And so we're just going to read that JSON file and take a look at what's in there. Okay, so this is much more structured data, right? So there are question-answer pairs here, and it's very specific about this company. So the simplest way to use this data set is to concatenate actually these questions and answers together and serve that up into the model. So that's what I'm going to do here. I'm going to turn that into a dict and then I'm going to see what actually concatenating one of these looks like. So, you know, just concatenating the question and directly just giving the answer after it right here. And of course you can prepare your data in any way possible. I just want to call out a few different common ways of formatting your data and structuring it. So question answer pairs, but then also instruction and response pairs, input output pairs, just being very generic here. And also, you can actually just have it, since we're concatenating it anyways, it's just text that you saw above with the pile. All right, so concatenating it, that's very simple, but sometimes that is enough to see results, sometimes it isn't. So you'll still find that the model might need just more structure to help with it, and this is very similar to prompt engineering, actually. So taking things a bit further, you can also process your data with an instruction following, in this case, question-answering prompt template. And here's a common template. Note that there's a pound-pound-pound before the question type of marker so that that can be easily used as structure to tell the model to expect what's next. It expects a question after it sees that for the question. And it also can help you post-process the model's outputs even after it's been fine-tuned. So we have that there. So let's take a look at this prompt template in action and see how that differs from the concatenated question and answer. So here you can see how that's how the prompt template is with the question and answer neatly done there. And often it helps to keep the input and output separate so I'm actually going to take out that answer here and keep them separated out because this helps us just using the data set easily for evaluation and for you know when you split the data set into train and test. So now what I'm gonna do is put all of this, apply all of this template to the entire data set. So just running a for loop over it and just hydrating the prompt. So that is just adding that question and answer into this with F string or dot format stuff here with Python. All right, so let's take a look at the difference between that text-only thing and the question-answer format. Cool. So it's just text-only, it's all concatenated here that you're putting in, and here is just question-answer, much more structured. And you can use either one, but of course I do recommend structuring it to help with evaluation. That is basically it. The most common way of storing this data is usually in JSON lines files, so "jsonl files.jsonl." It's basically just, you know, each line is a JSON object and that's it, and so just writing that to file there. You can also upload this data set onto HuggingFace, shown here, because you'll get to use this later as well and you'll get to pull it from the cloud like that. Next, you'll dive into a specific variant of fine-tuning called instruction fine-tuning.