Welcome to Pre-training LLMs, built in partnership with Upstage and taught by Upstage's CEO Sung Kim, as well as Chief Scientific Officer Lucy Park. Welcome, Sung, Lucy. Thank you. Andrew. We are so excited to be here. Pre-training a large language model is a process of taking a model, generally a transformer neural network, and training it on a large corpus of text using supervised learning, so that learns to repeatedly predict the next token given an input prompt. This process is called pre-training because it is the first step of training an LLM before any fine-tuning to have it follow instructions or further alignment to human preferences is carried out. The outputs of pre-training is known as a base model and will cover both training from scratch, meaning from randomly initialized weights, as well as taking a model that's already been pre-trained and continuing the pre-training process on your own data. Training a large model from scratch is computationally expensive, requiring multiple state-of-the-art GPUs and training runs that can last weeks or months. For this reason, most developers won't pre-train models from scratch, and when they take an existing model and either use prompting or sometimes fine-tuning to adapt it to their own tasks. However, there are still some situations where pre-training a model may be required or preferred, and that's what Upstage has been doing for its customers. That's right Andrew. Our customers are pre-training models for various reasons. Some are building models for tasks in specific domains like legal, healthcare, and e-commerce. Others need models with stronger abilities in specific languages such as Thai and Japanese. Further, new training methods are making more efficient pre-training possible, like Depth Upscaling, which uses two or more sets of existing models to build larger models. For example, we trained our solar model using Depth Upscaling. Because of this technology improvements, we are seeing more and more interest in pre-training. Depth Upscaling creates a new, larger model by duplicating layers of a smaller pre-trained model. The new model is then further pre-trained, resulting in a better, larger model than the original. Our team at Upstage has empirically found that models created in this way can be pre-trained with up to 70% less compute than traditional pre-training, representing a large cost saving. Whether pre-training is the right solution for your work depends on several factors, such as whether a model might already be available that might work for your task without pre-training, and what data you have available, as well as the compute resources you have access to, both for training and serving. And lastly, the privacy requirements you may have, which may also implicate regulatory compliance requirements. So depending on the company or sector you work in, you might find yourself being asked to pre-train a model or at least consider doing so at some point in your work. In this course, you'll learn all of the necessary steps to pre-train a model from scratch, from gathering and preparing training data, to configuring a model and training it. You'll start by looking at some use cases where pre-training a model is the best option to get good performance, and discuss the difference between pre-training and fine-tuning. Next, you'll walk through the data preparation steps that are required to pre-train a model. You'll explore how you can gather data from the internet or existing repositories like HuggingFace, and then look at the steps to obtain high quality training data, including deduplication, filtering on the length of text examples, and language cleaning. After that, you'll explore some options for configuring your models architecture. You'll see how you can modify Meta's Llama models to create larger or smaller models, and then look at a few options for initializing weights, either randomly or from other models. Lastly, you'll see how to train a model using the open source HuggingFace library and actually run a few steps of training to observe how the loss decreases as the training progresses. This course uses smaller models with just a few million parameters to keep things lightweight enough to run on a CPU, but you'll be able to use the code from the lessons to scale to both larger datasets and models, and also to train on GPUs. Thanks, Lucy. This sounds like it would be very helpful for building intuition about when pre-training makes sense, and what is required to carry it out. Many people have worked to create this course from Upstage, I'd like to thank Chanjun Park Sanghun Kim, Jerry Kim, Stan Lee, Yungi Kim, as well as their collaborator Ian Park. From DeepLearning.AI Tommy Nelson and Geoff Ladwig also contributed to this course. I'd like to reiterate that pre-training large models of large datasets is an expensive activity, with a minimum cost of maybe about $1,000 for the smallest models, up to tens of thousands of dollars to hundreds of thousands of dollars for maybe a billion parameter scale model. So to be careful, if you choose to try this out yourself, there are calculators like one from HuggingFace that you'll see in the course that can help you estimate the costs of your pre-training scenario before you get started. These can help you avoid unexpected large bills from your cloud provider. But pre-training is a key part of the LLM stack, and whether you just want to build your intuition about LLMs or continue their pre-training of an existing model, or even try to pre-train something from scratch to compete on the LLM leaderboards, I hope you enjoy this course. So, let's go on to the next video and get started.