Now let's move on to how to perform inference with linear quantization. As we saw in the previous lesson, in a neural network, you can quantize the weights, but you can also quantize the activation depending on what we quantized the storage and the computation on that the same. So, if you only quantize the weights, the computation will be using floating point arithmetic. So, floating point 32 floating point 16 or bfloat16. If you also quantize the activation you will be using integer based arithmetics. For the first case where you only quantize the weights. Note that you need to dequantize the weights to perform the floating point computation. If you quantize also the activations, you will be using integer based arithmetics, but this is not supported by all hardware. Now let's have a look at what happens when you quantize the weights in eight-bit and the activation remains in 32 bits, also called W8A 32. Now let's see how to code a linear layer when the input is quantized. For simplicity the linear layer will be without bias. So the function name is quantized linear without bias. The inputs will be quantized. In that layer we will have the inputs. This function takes several arguments. So the first one is the input to the linear layer. Then we have the quantized weights, the scale and the zero point. As we said, the activation is in float32, so the input dtype should be equal to a torch.float32. Also, the quantized weights should be in torch.int8. Now all we need to do is to first dequantize the weights. So the quantize weights equals to q underscore w. We recast these weights to a torch.float32. We will multiply it by the scale and we will add the zero points. Then all we need is to call the linear layer with the input and do the dequantized weights. And we return the output. Let's try this function on a simple example. So, the input will be the following: And we will have this as the weight of the linear layer. Let's quantize these weights using the symmetric linear quantization. So we call linear q symmetric and we just put the weights. This should give us the quantized weights as well as the scale. Then all we need to do is to call the quantize linear without bias function. And we put the input and the quantized weights. The scale and the zero point, which is equal to zero. Let's check the outputs. And we can also check what the floating point 32 output gives us. So we just call the linear layer with the inputs and with the original weights. And as you can see this is the floating point 32 outputs. And these two values are quite similar. that's it. Thank you very much. In the next lesson, Younes will show you how you can build your own eight-bit quantizer with everything we learn here. And apply it to real models.