Now, let's go even smaller and do per group quantization. In per group quantization we perform quantization on groups of n elements. Common values for n are 32, 64, or 128. Per group quantization can require a lot of memory. Let's say, we want to quantize a tensor in four-bit, and we choose a group size equal to 32. We use symmetric mode. That means that the zero point is equal to zero, and we store the scales in floating point 16. It means that we are actually quantizing the tensor in 4.5 bits. Since we have four bits, since each element is stored using four bit and we have 16 divided by 32 bit. Since we need to store a scale in 16 bits for every 32 elements for each element, you store it in four bit, but you also have quantization parameters and you need to store once a scale in 16 bits, so 16 bits every 32 elements. Now let's jump to the code. For simplicity, we will restrict ourselves to the case where the tensor is of dimension two and we will be using the symmetric mode. You don't need to pay attention to this code since we will be coding in the notebook. Now let's code it. So we define the following function. Linear q symmetric per group. This will take as argument the tensor, the group size and the d-type. We set the default value torch.int8. First, we need to get the shape of the tensor. Then, another restriction for this function is that we will be performing quantization on the rows. This is why we also need to make sure that each row is divisible by group size. To confirm that, we will just use assertion so that the shape of the tensor along the rows is indeed divisible by group size. Then, as I said, we will be restricting ourselves to tensors of dimension two. Now all we need to do is to reshape the tensor so that we end up with rows of group size elements. To do that, we will use the view function that we learned about. So as you can see, what we do here is to make sure that each row contains group size elements. And we put the minus one here so that it infers automatically the right dimension to have in the first dimension. And now if you look at the tensor we has the setup for performing positional quantization. We resized this tensor so that we have rows of group size so that we can use the function that we coded previously. That is to say the linear q symmetric channel quantization. So we have quantized tensor and scale which is equal to linear q symmetric per channel function. And we need to put the tensor the right dimension. So along the rows and the d-type. After quantizing the tensor we still need to reshape it to its original shape. So we will use the shape that we stored before. Here the d shape. To reshape the tensor, we use the view and we just pass this shape. Then we can return the quantized tensor and the scale. Now that we have coded the per group quantization, now let's code the linear quantization for the quantization in order to verify our results. So we need to define this function. In that function we need the quantized tensor to scale. But we also need the group size. Then we need to get the shape of the quantized tensor. That will be useful. Then we need to reshape the quantized tensor so that we have rows that contain only group size elements. To do that, we put in the view methods minus one for the first value and group size for the second one. Then we can reuse the linear dequantization methods we coded before to dequantize the tensor. We need to pass the quantized tensor, the scale and decimal point. But since we are doing symmetric quantization, the zero point is equal to zero. Then all we need to do is to reshape the dequantized tensor with the shape of the original tensor, and the shape is stored in q shape. Then we return the dequantized tensor. Now let's test our implementation. We will test a random tensor of size six by six and let's set group size to be equal to three. So, we will get the quantized tensor and the scale using the linear q symmetric group function. And we need to pass the test tensor as well as the group size. Then to verify our results we also need to dequantize the tensor using the linear dequantization function where we need to pass the quantized tensor, the scale, and the group size. Finally, to have the summary of the quantization process, we just need to pass inside the plot quantization error. The following arguments. So test tensor, quantized tensor and dequantized tensor. And as you can see, if you look at the quantized tensor you will see that every three elements in each row you will have the maximum value 127. It shows that we indeed managed to quantize each three elements in this matrix along the rows. So three elements here, three here, and so on. And you have the quantized tensor. As you can see on the right. And you can see also that the quantization error tensor is very, very low and that the dequantized tensor is practically the same as the original tensor. Let's also print the dequantization error using the dequantization error function. And we just need to pass the test tensor and the quantized tensor. And indeed we have a very very low quantization error. Now is a good time to pause the video and try a couple of things. You can try to change the test tensors. Or you can also change the group size. And to see what is the effect of the group size on the dequantization process.