This lesson is all about quantizing models for additional performance improvements. You will quantize a model, thereby improving performance by nearly 4x, while also reducing model size by 4x. Let's get started. First, I'd like to explain what quantization is and why you should consider quantizing your model. Quantization is the process of reducing precision of your model to speed up computation and reduce model size. It can make your models four times smaller and up to four times faster. There are three main benefits of quantization. The first is to reduce model size, which enables better storage on-devices that have limited capacity. The second is faster processing because you have fewer computations to deal with, your models will run faster, and finally, you have lower power consumption, which makes it really crucial for battery-operated devices. let's now enter the next level of detail of what quantization is. Let's say you have a floating point tensor. It takes 32 bits for value. So in this case you have eight total values each occupying 32 bits of space. Quantization can convert that into an integer representation that contains eight bits per value. So each of these values in the floating point tensor here can be represented in integer and translated back and forth with just the scale and the zero point. Let's learn what the scale and zero point is with a simple example. Now in this particular figure the floating point range is represented as a number line above here in red. And the integer range is represented as a number line here in blue. The floating point range is typically much larger than the integer range, because this is 32 bits of information, while this is eight bits of information. The scale and the zero point are two concepts that enable you to translate a number from the floating point scale to the integer scale, and back. Let's look at one particular example. Let's say you consider the number 12.5. The scale is set to 2.35 and 0.80. So the quantize operation, which converts a number from the floating point range to the integer range, will divide it by the scale and subtract the zero point. Hence 12.5 by 2.35 gets stored as a five. Now, in order to convert the integer back into its floating point representation, you need to perform the dequantized operation that's described to you in the right. The dequantize will take the five at the zero point and multiply it by the scale. Notice that you do see an error in here, because the number 12.5, when converted to integer, provided five, which when converted back to floating point gave you 11.75. And this is the error that's called quantization error. The goal of the quantization process is to have the minimal error while converting from the floating point range to the integer range, and this can be done for a wide variety of applications. Let's now learn about how you can quantize, and what are the ways in which you can get the minimal quantization error. There are two main types of quantization. One is weight quantization, where you only reduce the precision of the model weights, so you optimize only storage. The second is called activation quantization, where you apply a lower precision to activation values so that you can accelerate the entire inference using lower precision numerics. Here's an example of what weight quantization does. We took float32 tensors eight of them and converted them into int8 tensors. This reduced the model size by 4x. In activation quantization, we can take a function in floating point like the Sigmoid in red, and convert it to a function that can be done with integer computation, which is described in blue. As you can see, there is an error in approximation of the red function with the blue function and the goal is to minimize this error. Now there are different flavors of weight and activation quantization, the most aggressive being the W8int8, which means you convert the weights to eight bit, as well as the activations to eight bit. You have the W8int16 quantization, where you convert the weights to eight-bit, but you keep the activations in a 16-bit range. This gives you more to play with when you're approximating the activation, while keeping the weights at eight-bit. You can even go aggressive with quantization and go all the way down to W4, which involves the weights in four-bit and activations in 16-bit. This is extreme popular for large language models and generative AI applications. There are two common ways in which you do quantization. One is post-training quantization and the other is quantization aware training. Post-training quantization applies the quantization process after the model has been trained, and it does it using calibration, which is a process where you learn and minimize the amount of quantization error with some sample data. Typically, you need a few hundred samples to calibrate your model well for post-training quantization. Quantization aware training, is a process in which quantization is incorporated into the training process, and makes sure that the model is learning fundamentally integer values. This can provide more accurate models in scenarios where post-training quantization doesn't work. The chart on the left describes the post-training quantization workflow. Where you have a trained model in float32 precision. You then provide a quantized representation of the model and integer rate representation. In order to reduce the quantization error, you go through a calibration process where you provide calibration data. This is typically a few hundred samples and this calibration process minimizes the accuracy loss and produces a post-training quantization model. Quantization aware training workflow is slightly different, where you have to provide training data as part of your training process and as part of the training process, you learn the entire weights of the model and integrate representation. Hence, quantization is incorporated into the training process. Now let's see this in action in a notebook. In this notebook you will learn how to prepare the calibration data set. How to prepare the model for quantization. You will perform post-training quantization and you will validate the model for off-target accuracy as well as on-device performance. First, you will set up the calibration data. You will use the data sets package with the urban scenes data. The calibration data here is about 100 RGB images of the urban scenes data. The input resolution of the network is 3x1024x2048. Once the data set is downloaded, you will hold out a little bit of the data set for testing purposes. Let's see what one of the images in the calibration data set looks like. This is an urban scene that has some roads, some cars, a train, some buildings, and some people. Next, let's set up the calibration or inference pipeline. You will use the Torch Vision Transforms package. This particular function will take the image that you saw above and transforms it to a torch tensor using the two-tensor API. Once it's converted to a torch tensor, the dimensions of this particular array are reduced in order to be a four dimensional array of size 1x3x1024x2048. 131 00:08:08,733 --> 00:08:11,733 You also need to post-process the output of the model that's done using the post-process function here, which takes the output tensor from the model Upsamples it to the original output size, and then overlays the image on top of the predictions so that you can view the colors appropriately. Now that the pre-processing and the post-processing is set up, let's load up the model in floating point. We load that up in PyTorch using the from pre-trained API. We then take the test sample and pass it through the model to get the output results in floating point 32. We apply the post-processing that we just defined, so you can view the results in line in the notebook. As you can see, the output is the predictions overlaid on top of the original image. The light pink refers to the road, the darker pink the pavements, the red, the pedestrians, the yellow are the lights and the blue are the vehicles. Now, let's prepare the model for quantization. In order to do so, we will use the Python package called aimet. The quantization preparation process requires you to call this prepare model function of aimet. This particular prepare model function annotates all the floating point arrays that are there in your model. This includes the weights as well as the activations, and sets up the automatic integer versions of that same graph. So you can set everything up for the calibration process that comes right after. Now let's perform Post-training quantization on this model. We've written a simple function that passes calibration data to this model, and learns all the calibration parameters that are required for post-training quantization. This particular function will take a few minutes, and the compute encodings is the main aspect of this computation. The compute encodings function make sure that it learns the correct zero points in the scales for all of the parameters in the graph, so that you can have minimal quantization accuracy. Once the calibration process is complete, you can then provide the same test sample through the fully calibrated and quantized model in your PyTorch environment, and get the sample outputs that are also fully quantized. The resulting output will pass through the same post-processing function that we wrote down for the Float32 version, and you can see the results in line. As you can see the output predictions from the integer model match up the float model quite well. You have the road that's depicted here in light pink dark pink for the pavements, red for the pedestrians, yellow for the lights, and, blue for other cars. As the final step. You will export the quantized model for the device. You will use the same quantize export function that you used in the second lesson with the Samsung Galaxy S23 device. Now, the quantized model is then submitted to the server, optimized for the device. A device is provision to measure performance, and a summary of that particular model's performance is provided to you. This particular function takes about a couple of minutes to run. As you can see, the performance results on the device shows that this particular model ran in about 6.4 milliseconds. So that's about 4x faster than our model. It ran entirely on the neural processing unit. And memory consumption was between 1 and 10MB. The peak signal to denoise ratio was a little bit lower than the floating point model, because this is doing integer computation and the model was about 33 DB PSNR. Just as a reminder, anything more than 30 is typically considered good. The impact of quantization, both on model size as well as performance is significant. The model size reduced here from 55MB down to 13MB, and the latency improved from 16.9 milliseconds to 4.6 milliseconds, giving you about a 3.7 speedup. These measurements were performed on the Samsung Galaxy S24 for. This concludes the lesson on quantization. Here you learned how to take a model that was trained in the cloud and floating point precision quantized into integer precision, thereby producing a model that's up to four times smaller and up to four times faster. In the next lesson, you learn how to take this particular model and produce an end-to-end mobile application that you can deploy to do real-time segmentation. See you there.