The thing is. That's right. Now, we tried everything on a dummy model. We want to test out also on real use cases. Yeah. Meaning useful models. So let's test out our implementation and our quantizer right away on models that you can find on Hugging Face transformers. So let's get started. So for this demo we're going to use this model called Salesforce Codegen 350 m mono. So this is a language model that has been fine-tuned on code. And it has only 350 million parameters. Let's use transformers to load the model together with the tokenizer and get some generation. And let's use the text generation pipeline in order to get some text generation. So we're going to load pipeline text generation and pass the model. And the tokenizer. And then since it's a model that has been fine tuned on code, let's try to battle test the model by giving it some code completion task, and try to see if we can generate some consistent text in order to print "Hello World" in Python. So the "Hello World" print "Hello World", which seems to be correct. And then the model has also suggested to call the method right after. And then, well, it ended up with some comments in perhaps Korean but that's fine. Yeah. Overall the model seems good. And also bear in mind that the model is a small model. It's, 350 million parameters. You might get more impressive results with larger models, but anyway, yeah. Let's just out of curiosity, print the model before quantization. So we're going to shrink the model, quantize it in eight-bit and try to get some generations on the quantize model. So yeah this this is how the model looks like. You have a lot of linear layers because it's a transformer based architecture. So most of the most of the weights come from, some linear layers on the model. And let's call our quantization API replace linear with target and quantize model. Target class. And as I said we're not going to quantize the language model head because since the model is an autoregressive model, it uses the output from the previous iteration to get the output of the next iteration. If you quantize the language model head, a lot of errors might might be accumulating over the generation steps. And you will most likely end up, having some gibberish after some tokens. And though this method modifies the model in place. So we're just going to inspect pipe dot model and check if the model has been correctly quantized. So yeah, we can confirm, the model has been quantized by checking that the linear layers has been replaced with the, W8A16 in our layers. We can also confirm that the language model had is still a torch and then linear which is expected intended. So let's do some generation and see the results. Perfect. So it's able to generate, correct methods of printing "Hello world" in Python. It's also try to call Hello world in the main Python file but somehow the model has commented the code to the method, but I guess that's fine. For generative models, since the model generates output from past inputs. So it's, it's an autoregressive model. All the rounding errors can sum up once you start generating a lot of tokens, until maybe all of these errors get super large so that it affects, the model's performance. This specifically affects larger models. So maybe greater than 6 billion parameters. And the the whole no performance degradation quantization for LLMs is a whole exciting topic. And, it has been also addressed by many recent papers, such as LLM.int8, SmoothQuant, GPTQ, QLoRA, AWQ and so on, and probably many, many more that I forgot to mention. And we're also going to briefly, explain a little bit the insights behind these papers in the next lesson. All right. We can also try out to quantize models from other modalities. I wanted to show you how to call the quantizer on an object detection model. So the workflow is going to be typically the same. We're going to call the API that we have designed on a model that we will load from transformers. So we will use this class Detr for object detection. So Detr is an architecture that has been designed by Facebook AI that is used for object detection. So you can inspect the the model card on the hub in order to get the code snippets on how to run the model. So this is the way to load the processor and the model, from Hugging Face hub. And we're going to call our quantizer on the model itself. Okay. So before quantizing the model we can get the memory footprint of the model before quantizing it. So we just have to call model. get_memory_footprint. And the size of the model should be something around 170MB. Perfect. Yeah. I want to quickly test the model before quantizing it and visualize some results. So we recently went to dinner altogether and we took a nice picture. So we're going to use this picture and try to detect as many as many objects as possible using the model. Perfect. Yeah. So if you if you check out the model card of the model, you can get an idea of, what is the code snippet to, to use in order to run the model? So we're just going to use that and call plot results on the results. So very cool. it was able to detect, all the people here. So with the correct class person, it was also able to detect, the table here on the left, the phone, the cups, the knife, and also the card that is on the background. And even this laptop, there is a bit far in the image. That's really cool. Yeah. So let's try to quantize the model and visualize the results of the quantized model. So, before doing that let's quickly inspect the model. So the impact of the quantization is going to be a little bit lower than a language model. Because as you can see here. So there are many convolutional based layers. So these layers are not going to be quantized. But if you go down the layers that are going to be quantized are going to be, the linear layers that are here in the encoder and decoder here. And yeah, so we're still going to keep that practice of trying to not quantize the last layer. So the bounding box predictor we're going to keep it in its original precision. So we're going to call replace linear layer model target class 012. So that we don't quantize these layers and the class labels classifier as well. And everything is specified here. Perfect. Let's inspect the model again. So yeah as expected the convolution layers are still the same. And the encoder and decoder layers seem to be correctly quantized in int8. Perfect. And of course we kept the last modules, in their original precision. Let's visualize the results with the quantize model. Perfect. Yeah. So yeah, I think we got pretty much the same results. I think we were able to detect the same instances. Yeah. Even the computer that you can see on the background, the chairs and the car as well. Let's also try to have, a good idea of how much memory did we managed to save. So this is the new memory footprint. So yeah, we were able to save around 50MB. So before we were around 160, 170. So yeah, it's a reduction of maybe around 25 to 30%. So that's that's not bad. And we managed to keep most of the capabilities of the model. So yeah. Now, I invite you to pause the video and you can also try out, this approach on other models, other modalities. You can also try to maybe, break the quantizer and see what went wrong. Maybe, I don't know, try to also quantize the last module to see how does it affect the model's performance. And yeah, you can also try that out on as many modalities as you want. You can try it on a vision model. You can also try it on an audio model, on a multimodal model. So yeah, feel free to pause the video. Try out the API we designs together on other models. And also bear in mind that, the API modifies the model in place. So once you have loaded the model and called the quantizer on the model, you need to reload the model if you want to compare it with its, original version.