Quantization in Depth - DeepLearning.AI

Loading...

Welcome back!

We'd like to know you better so we can create more relevant courses. What do you do for work?

Subscribe to receive AI news, events and course updates from DeepLearning.AI!

Course Syllabus

AI Python for Beginners is a sequence of 0 connected courses. You can navigate to the other courses by clicking on the cards below

Explore Courses
Community
My Learnings

You’ve achieved today’s streak!

Complete one lesson every day to keep the streak going.

Su

Mo

Tu

We

Th

Fr

Sa

You earned a Free Pass!

Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.

In this lesson, you will dive deep into the theory of linear quantization. You will implement from scratch the asymmetric variant of linear quantization. You will also learn about the scaling factor and the zero point. Let's get started. Quantization refers to the process of mapping a large set to a smaller set of values. There are many quantization techniques. In this course, we will focus only on linear quantization. Let's have a look at an example. On your left you can see the original tensor in floating .32. And we have the quantized tensor on the right. The quantized tensor is quantized in torch.int8, and we use linear quantization to get this tensor. We will see in this lesson how we get this quantized tensor. But also how do we get back to the original tensor. Let's have a quick recap on what we can quantize in a neural network. In a neural network you can quantize the weights. That is to say, the neural network parameters. But you can also quantize the activations. The activations are values that propagates through the layers of the neural network. And if you quantize a neural network after it has been trained, you are doing something called post-training quantization. There are multiple advantages of quantization. Of course you get a smaller model, but you can also get speed gains from the memory bandwidth and faster operation, such as the matrix to matrix multiplication and the matrix to vector multiplication. We will see why it is the case in the next lesson when we talked about how to perform inference with a quantized model. There are many challenges to quantization. We will deep dive into these challenges in the last lesson of this short course. But now I'm going to give you a quick preview of these challenges. Now, let's jump on the theory of linear quantization. Linear quantization uses a linear mapping to map the higher precision range. For example, floating point 32 to a lower precision range for example int8. There are two parameters in linear quantization. We have the scale S and the zero point z. The scale is stored in the same data type as the original tensor, and z is stored in the same datatype as the quantized tensor. We will see why in the next few slides. Now let's check a quick example. Let's say the scale is equal to two and the zero point is equal to zero. If we have a quantized value of ten, the dequantized value would be equal to 2(q-0), which will be equal to 2*10, which will be equal to 20. If we look at the example we presented in the first few slides, we would have something like this: So, here we have the original tensor. We have the quantized tensor here. And the zero point is equals to -77. And the scale is equal to 3.58. We will see how we get the zero point and the scale in the next few slides. But first, we have the original tensor and we need to quantize this tensor. So, how do we get Q? If you remember well, the relationship is r=s(q-z). So how do we get q? To get the quantized tensor we just need to isolate q and we get the following formula. So, in order to get the quantized tensor, as I said before, you need to isolate q. So first, we have r=s(q-z). We need to pass s to the left side by dividing it by s. Then we put the zero point on the other side by adding a z on this side and on this side. So we get the following results. As you know the quantized tensor is on the specific d-type which can be eight-bit integers. So we need to round that number. And the last step would be to cast this value to the correct d-type such as int8. Let's code that. In this classroom the libraries have already been installed for you. But if you are running this on your own machine, all you need to do is to type the following command in order to install torch. Pip install torch. Since in this classroom the libraries have already been installed, I won't be running this comment, so I will just comment it out. Now, all we do need to do is to import torch. Now, let's code the function that will give us the quantize tensor knowing the scale and the zero points. So, we define a function called linear q for quantization with scale and zero point. This function will take multiple arguments. So we have the tensor. We have the scale. We have the zero point. And we also need to define the d type which will be equal by default to torch.int8. So, the first step is to get the scaled and shifted tensor. As you can see in the formula right here. So, (r/s+z). So we are going to first calculate that. So this specific tensor will be equal to tensor divided by scale, plus zero points. We need to run the tensor. As you can see in the formula. So we will just create the variable around the tensor. Which will be equal to torch.round. The round method will enable the torch.round methods. We round the tensor that we pass. And the last step is to make sure that our rounded tensor is between the minimum quantized value and the maximum quantized value. And then we can finally cast it to the specified type. Let's do that. So first, we need to get the minimum quantized value and the maximum quantized value. So to get the minimum quantized value we will use the torch.iinfo methods. We will pass the dtype that we define in the attribute of the function. And to get the minimum we just need to pass min. We do the same thing for the maximum value. Now, we can define the quantized tensor which will be =rounded_tensor.clamp(q_min_max). And we can cast this tensor to the quantized dtype you want, such as int8. And the last step is to return the quantized tensor. Now that we have coded our function let's test or implementation. So we'll define the test tensor. We will define the same tensor that you saw in the example on the slides. And we will assign random values for scale and zero point. Since we don't know how to get them yet. So I'll just put scale equals to 3.5 and the zero point to -70. Then let's get our quantized tensor by calling the linear q with scale and zero point function that we just coded. And we need to pass the test tensor, the scale that we and the zero point we defined earlier. And now let's check the quantized tensor. As you can see we managed to quantize the tensor. And we can see that the dtype of the tensor is indeed torch.int8. So now that we have our quantized tensor, let's dequantize it to see how precise the quantization is. So, the quantization formula is the one we saw in the slides. We have r=(q-z). And we will use just that. So, to get the dequantized tensor we will just do scale * (quantized_tensor.float() because we need to cast it to a float. Otherwise we will get weird behaviors with underflow and overflows. Since we are doing a subtraction between two int8 integers. Let's check the results. So we get these following values. But let's check what happens if we don't cast quantized tensor to float. What we will get is the following results. Which is not the same as you can see here we have 686 and now we have -210. Now, let's put it into a function called linear dequantization. So, for the linear dequantization function we need to put as arguments the quantized tensor the scale and the zero point. And then we just need to return what we called it above. As you can see on the right, you have the quantization error tensor. We have for some entries pretty small values, which shows that the quantization worked pretty well. But, as you can see here we have also pretty big values. To get the quantization error tensor we just subtract the original tensor and the dequantized tensor and we take the absolute value of the entire matrix. And at the end, as you can see, we end up with a quantization error of around 170. The error is quite high because in this example we assign a random value to scale and zero points. Let's cover in the next section how to find out those optimal values.

course detail

How Was Your Experience

Thank you for taking the time to provide feedback on your course experience! Please take a moment to rate the course and share any comments you may have.

Would you recommend this short course to people in your network? (0=Not likely, 10=Extremely likely)
012345678910
Feedback about the Course:
Feedback about the Platform: