In this lesson, you will implement a technique called linear quantization. This is the most popular quantization scheme, and it is used in most state-of-the-art quantization methods. Then, you will apply linear quantization to real models using Quanto, a Python quantization toolkit from Hugging Face. Let's get started. Quantization is the process of mapping a large set to a small set of values. There are many quantization techniques. In this course, you will focus on linear quantization. Let's have a look at an example of how to perform 8-bit quantization on a simple tensor. We will go from float32 values to 8-int values. This will give you the intuition on how linear quantization works. Let's take a look at this matrix of random numbers. The values are in float32. How do you convert the float32 weights to int8 weight without losing too much information? Well, let's try this. You can map the most positive number in this matrix, which is 728.6 in this case, to the maximum value that int8 can store, which is 127. Similarly, you can map the most negative number of this matrix, negative 184 in this case, to the minimum value that int8 can store, which is negative 128. You can also map the rest of the values following a linear mapping. You will see a bit more of that math later in an optional section at the end of the lesson. But just assume that it is a little bit of multiplication and addition. That's it. You managed to quantize the tensor. Next, you can delete the original tensor to free up space. You end up with the quantized tensors plus the parameters s and z that you use to perform the linear mapping. "s" stands for scale and "z" for zero point. Looks like you saved a lot of space. But one question remains. How do you go the other way, from the quantized tensor, back to the original tensor in FP32? You can't get exactly the same as the original tensor, but you can perform the quantization following the linear relationship that you use to quantize the original tensor. Now, let's take a look at how you can perform the linear mapping. to perform linear quantization. We follow the linear mapping we defined previously to de-quantize the tensor. Again, you can see the details at the end of this lesson, but it's going to be some math. So, the minimum and maximum de-quantize will get you these values. And if you apply the same linear mapping to the other numbers, you can de-quantize the whole tensor. As you can see, quantization results in a loss of information. Let's compare the original tensor and the de-quantized tensor. The result is that the de-quantized tensor is pretty accurate. The quantization errors are not zero, but they are not too bad either. Even if linear quantization looks very simple, it is used in many state-of-the-art quantization methods. Now, you will use Quanto, a Python quantization toolkit library from HuggingFace, to quantize any PyTorch model using the linear quantization. In this classroom, the libraries have already been installed for you. If you are running this on your own machine, you can install the transformers library by running the following. pip install transformers Similarly, for the Quanto library, you need to type pip install quanto. And you also need to install Torch by typing pip install torch. Since in this classroom the libraries have already been installed, we don't need to run this cell, so I will just comment them out. Now, let's load the model using this specific class from Transformers library. So, we import from Transformers. The AutoModelForCausalLM class. We define the name of the model that we're going to load, which is this one. EleutherAI is a non-profit research lab focused on interpretability, alignment, and ethics in artificial intelligence. Then, we will use the from_pretrained() method to load the model. The first argument is the checkpoint. This is optional, but you can also set the lowCPUMemoryUsage argument to be true, so that it loads the model more efficiently. The model we just loaded is the Pythia model from EleutherAI. After loading the model, you will load the tokenizer, as well. To do that, we need to import from transformers library the AutoTokenizer class. Then we will use the from_pretrained() method from the auto tokenizer to load the tokenizer. And you just need to pass the model name. The tokenizer is used to transform the text into a list of tokens that the model is able to understand. Now, let's check if the model is able to generate text. To do that, we will use the generate method. First, we need to define the text. We will choose a very simple text, such as, hello, my name is. Then, we need to pass this text into the tokenizer. Then, we also need to define the return tensor to be pt, which stands for pytorch. This way, we will get pytorch tensors at the end. Finally, to get the outputs, we just need to call the generate method. We pass the input inside the generate function, and we need to put the double star in front, since inputs is a dictionary of arguments. We can also define the max new token argument. We will set it to 10. These arguments control the number of new tokens that the model generates. So, with these settings, the model can only generate a maximum of 10 new tokens. Right now, as you can see, the output is just a list of tokens. To decode this list of integers, we need to use the decode method from the tokenizer. This is optional, but we can also define skip_special_tokens to be true, to not have any special token in our output. And as you can see, the model generated the following text. ", and I am a newbie to this site" Now, let's check the size of the model. Pythia is a 400 million parameters model. So, since you loaded this model in floating point 32, each parameter takes 32 bits, which is 4 bytes. So, the model should take around 400 million times 4 bytes, which is equal to 1.6 gigabytes. Let's check that using the compute module sizes that we already coded in the helper.py file. To import that function, we just do from helper import compute module sizes. Then, we just call it on our model. And let's see what are the results. And as you can see, it says that the model size is around 1.6 gigabytes, just as we expected. Let's also have a look at the weights of one of the linear layers. As you can see, the weights are in FP32. Now, let's quantize the model. To do that, you need to import two functions from the Quanto library, quantize and freeze. We also need to import torch. Then, let's have a look at the architecture of the model. As you can see, the model has many layers, but the one we are going to focus on are the linear layers. These are the layers that we are going to quantize. To quantize the model, you just call quantize, you pass the model. You also need to specify the weights. We want them to be quantized to the dtype torch.int8. And, if you remember, in the model, you can quantize the weights, but also the activation. In this lesson, we will only quantize the weights. So, this is why we set activation equals to None. Let's check what happened to the model. As you can see, the linear layers were replaced by QLinear, quantized linear. And if we look at one of the weights of these linear layers, we see that the weights are still in FP32. The model is not fully quantized. For what you will do in this course, you don't need this intermediate state, but for a more advanced topic, the intermediate state is quite useful. If you are curious about when the intermediate state is used, please stick around for the optional section at the end of this lesson. Next, to get the quantized model, we just need to call freeze. Now, if you look at these weights, you can see that they are quantized in torch.int8. And we have also the linear quantization parameters scales right here. In this case, you don't see the zero point because the zero point is set to zero. Now, let's check the size of the model. As you can see, it's now only a fourth of its original size. It's good that we managed to decrease the size of the model, but let's also have a look at the performance of the model, if there is any performance degradation or not. Let's do the same thing as we did earlier. So, output equals to model.generate. Then, let's print the decoded output. And as you can see, we get the same output. This is not an extensive way of testing if there is any performance degradation, but still, it's good that we managed to get the same results. That's it for the required part of this lesson. All that's left are some optional discussion about the math of the linear mapping and an explanation for the intermediate state for the Quanto library. If you're ready to move on to the next lesson, Younes will give you an overview of how quantization methods are applied to large language models. The theory of linear quantization is very simple. It is based on a simple idea, linear mapping. But first, let's have a look at the figure. Here is the visual of the number line of the original tensor, which can be in floating point 32 on the top, which goes from R-min to R-max. The formula for linear quantization is as follows. We have R, which is equal to S times Q minus Z. R is the original value and Q is the quantized value. S is the scale and Z is the zero point. You can use this formula to quantize the original value or de-quantize the quantized value. But one question remains. How do you get the scale and the zero point? To get these parameters, you need to look at the extreme values. And you should get the following formulas. And after solving these two equations, you should get that the scale is equal to that and the zero point is equal to that. Feel free to pause the video and take out a pencil and a paper to derive the scale and the zero point yourself. But don't worry. Remember, this is optional and this is not required for you to successfully complete the course. Also, recall that the Quanto library creates an intermediate state after you call quantize. Then you call freeze to get the quantized weights. This intermediate state can be useful for two things. If you decide to quantize the model activation, when you run the inference on a model by passing an input such as an image, a text, the activation of the model will vary depending on the input. To get good linear parameters, to perform the linear mapping for the linear quantization of the activation, it will really help to know what is the minimum and maximum range of this activation. To do that, you can get some sample data that is similar to the data you would expect and run inference on the model. This process is called calibration. This is optional, but if you do the calibration, you will get better quantized activation. This intermediate state is also useful when performing "quantization aware training". "Quantization aware training" means that when you train the model, you can keep it in its intermediate state, which means that for the forward pass, you will use the quantized version of the weights, but the model will still update its original unquantized weights during backpropagation. The goal of quantization-aware training is to better control how the model will perform once you quantize it by calling freeze. Thanks for staying for the optional section. Next, Younes will give you an overview of how quantization methods are applied to large language models. Let's go on to the next video.