In this lesson, you will leverage the tools that you have just built in order to create your own quantizer to quantize any model in eight-bit precision. This quantizer is modality agnostic, meaning you can apply it on vision, audio texts, and even multimodal models. Let's get started and quantize some models. In this lesson, we will learn together about how to make our own quantizer to quantize any model in eight-bit precision using the per channel linear quantization scheme that you have seen in the lesson previously. So for that, we'll break down our project into multiple sub steps. So we'll first deep dive into creating a W8A16 linear layer class. And "W8" stands for eight-bit weights and "A16" simply stands for 16 bits activations. We use this class to store eight-bit weights and scales, as you have seen previously on the previous lesson, and then we will see how we can replace all instances of torch nn layers with that new class. And then we will build a quantizer to quantize our model end to end. And we will test our quantizer on many scenarios. And we will study the impact of the eight bit quantization on different models. So let's start with the first subtask that we have defined before. So building the W8A16 linear layer class. So for this task we'll also break it down into different multiple subtasks. First of all you will build a forward method called W8A16 forward that will take as input eight-bit weights, 16 bit inputs, scales, and optional bias. Once you have built this method, the idea is to call that method inside the linear layers forward pass and pass the 8 bit weights of the linear layer, input, the scales that are stored inside the layer, and the optional bias as well. So let's get started. So what the W8A16 forward method will do under the hood is to first cast the eight bit weights into the same data type as the input. So for example, in the case the input is in float16 or b float16, cast the weights into that precision while keeping the weights into the same range as before. So between -128 and 127, we'll just cast the data type of those weights in, half precision so that it will match the data type of the input. Then we will perform, the linear operation. So classic matrix multiplication between the input and the casted weights. We will multiply these results together with the scales of the model and optionally add the bias once this is done. So let's get started. So let's first import those modules that we will use for implementing our method. And I'm also I'm also defining here some random inputs a random int8 matrix, random hidden states, random scales, and random bias. So, typically the workflow would be something as follows. So we first cast the weights into the same data type as the hidden states. Then on top of that, you will perform the matrix multiplication by calling f.linear from PyTorch. All right. And then we'll multiply that with the input scales and optionally add a bias term at the end of the operation. Perfect. And notice also for the weight matrix. So it has the shape output dimension input dimension. When you perform the matrix multiplication between the weight matrix and the input hidden states, you will have a vector of batch size output dimension. So 132. So it's important that the scales have this the same shape as the output shape of your weight matrix. And same comment for the bias so that you can broadcast the operations between the output from here and the scales and the whole output here and the bias. Perfect. So let's wrap everything in a single method. Perfect. And let's also quickly try it out. With and without bias. Great. So it seems to work fine. So I guess we can move forward with the next building block that will leverage the method that we have just created. To continue building our linear layer class, we'll start implementing the init method of that class. So recall for this linear layer we need to store the int8 weights, the scales and the bias. Let's first start by implementing the skeleton of the init method. So it has to kind of match, the signature of the init method of a torch in our layer. So it has to contain input features, output features in order to correctly initialize the input matrix. The weights matrix bias, whether the linear layer has a bias term or not, and data type which would correspond to the data type of the bias. Because our weight matrix will have a torch that int8 as data type. So here we're going to define our int8 weights, together with the scales that are going to be stored in the linear layer. So if you have done any PyTorch before this lab, you might be directly trying your hands on doing something like this to create your int8 weights, assigning the new attributes int8 weights, being a parameter. And then maybe do something like this. So the issue with this, with this approach is that when you create an nn. parameter, PyTorch expects that parameter where it's able to compute gradients on it. The issue is that with PyTorch, you can't explicitly compute gradients on int8 tensors, yet, so you should get an error if you try just to initialize a dummy layer with this approach. So if you try that out, you get an error saying "only tensors of floating point and complexity can require gradients." So, the right approach to store int8 weights is instead of saving attributes as being an endless parameter, is to call this method called register buffer. That way instead of storing a parameter, we just store a buffer, meaning we don't need to compute gradients on the tensor, and you can initialize it with whatever dtype that you want. So if you try that out and initialize, it just works. So let's continue designing our linear layer. So we have our int8 weights. And then we'll do the same thing for scales as well by initializing with the correct shape. And we're also going to call register buffer on scales because again, here, we're just expecting to do simple inference. We're not interested in doing training. So just calling registered buffer is sufficient. And then we're going to store an optional bias. So if bias is set to true we're just starting a new buffer called bias. Otherwise we'll set it to none. So let's quickly try that out and create a dummy instance of a linear layer and see if our attributes have been correctly saved. Perfect. So yeah. So all the expected attributes have, the expected shape. So output shape input shape output output shape or the scales. I guess we can move forward with the next task, which is building the forward pass of that class. So we're going to copy what we did here. And we're going to call the method that we have defined in the first sub task. And we're just simply going to call it on self.int8 weights, self.skills, and self.bias. And this method will do everything under the hood for us. And we'll take care of casting the weights into the correct dtype and multiplying everything with the scales and optionally add the whole results with bias. All right. So let's create a new module. Some dummy hidden states with the shape batch size, sequence length, hidden shape, which should match the input hidden states shape that we have, passed here. Perfect. So we still have batch size, sequence length. And here, instead of input shape, we have output shape. It's also check data type. So the dtype is correct Float32. Because we have initialized a random tensor and by default PyTorch initialize everything in torch, not Float32. Great. Now that we have a forward pass that is working a linear layer class that has all the needed attributes, we need to build, quantize method in order to perform the linear quantization algorithm that you have seen in the previous lesson, so that the weights gets correctly, quantized. Because right now everything is random. So you need to replace all the layers with this, linear layer, you'll get gibberish output most likely. So just to give you more idea, once we have defined that quantize method, the workflow will be the following. So, you have your base model that is let's say in half precision. So either Fp16 or Vf16. Will loop over all the linear layer classes, replace them with our new linear class, and then call quantize by passing the old weights in order to quantize the old weights into int8. So let's redefine our class again. And here start thinking about the quantize method. So as I said the quantize method will take the original weights as input. It will quantize the weights in Int8 precision, get the scales of the quantization, and then manually assign int8 weights and scales to the computed quantized weights and scales. So let's do that step by step. So first of all, I would recommend to upcast the weights in FP32 for stability. So we'll get the weights in Fp32 first and then we will use this simple formula that you have seen in the previous lesson. So we'll first get the absolute values of the weights. Get the maximum on the last dimension and divide it by yeah 127 in order to get the scales. So we're going to assign that to scales variable. And make sure that scales has the same datatype as the input weights by calling two weights the dtype. And to get the int8 weights, we'll just apply the formula that you have seen on the previous lesson on linear quantization. So this is the per channel, linear quantization as you're getting the maximum on each element of the last dimension. So yeah basically this is how you get the int8 weights. Again, it's based on the previous lesson unrolled. We're just simply assigning self dot int8 weights and scales with these tensors. Perfect. And the forward pass will stay the same. Perfect. Okay. So let's let's try that out. So let's first initialize the dummy module. Maybe print the int8 weights before quantizing and we'll compare the results afterwards as well. I might just take a smaller. Yeah. All right. Let's also pass some dummy random original weights. So this random matrix will act as the original weights that we're going to use to quantize your module. So let's call module.quantize. Perfect. So as you can see the weights tensor are completely different. And because we quantized the module with the correct weights. And also the int8 weights are now between -128 and 127. So those values did not exist before because the module has been initialized randomly. And since here we're performing abs max quantitzation. We always have these values in the quantized int8 weights. Yeah. We can also inspect the scales of our quantized module which look like this. I want you to quickly inspect also the shape of scales. So you have a tensor of size four which is the expected output shape. So we're going to do the same for int8 weights. So four eight. So if we directly multiply the two tensors it won't work. So you have to add a new dimension here in scales. All right. So let's compare that against our original weights. Yeah. So as you can see, if you quickly look into it, the weights look pretty close. We can have also a better idea by computing the quantization error that can be done through this formula. So we just, subtracting both tensors. So the quantized weights the original weights absolute value. So this is the average quantization error for each element between the original weights and the dequantized weights. Perfect.