Before wrapping up the lesson. I wanted to maybe, maybe quickly show you a, follow up approach of the quantization API. So, note in the current design, we need to first, load the model in its original precision in order to quantize it. So ideally, this is not really optimal because you need to allocate enough Ram in order to load your model in the default d-type and then quantize the model. In practice, you can maybe quantize the model using a large instance. So if let's say you have bigger machine, then you can quantize the model using your machine and then push the quantized weight somewhere on the cloud. For example, on Hugging Face hub, and then directly load the model in eight-bit precision or even lower precision on your machine. So let's try to build something that covers this approach. So let's imagine we are now in the in the big instance where you have enough Ram where you can quantize the model. So you can let's say load a language model. For example, opt 125 M. So we're just going to load a small the model for just demonstration purpose. And let's say we're quantizing the model. So we're calling our quantizer here. Let's make sure the model is quantized. So that's the case. You can retrieve the quantized state_dict by calling model that state dict in order to get your model's weights. And then you can save it somewhere. So yeah, let's say we, we saved it locally here. And then you can use, some utility methods from the Hugging Face hub library in order to push these quantized weights on the hub. So I'm just going to show you how the API would look like. So you just have to import Hfp and create repo from Hugging Face hub. You can put your Hugging Face username here, and then you can define your the ID of the repo where you want to push the quantized weight. So it's going to be under your username name of the repo. So here I decided to call it opt 125 M quantized Deep Learning AI. Here we call API Hfapi to initialize the Hugging Face hub API. Optionally, you can create the repo. You can also do it manually through the Hugging Face website. And then you just have to call API dot upload file. So the path in your local computer of the object of that you want to push, and then the target pass in the repo name of the repo is your code that you can successfully push your quantized weights. on Hugging Face hub. So yeah. So you can imagine that in your large instance where you have enough GPU or CPU Ram, you quantize your models as such and then you push them on Hugging Face hub. And then now in my local machine, all I have to do is to load this smaller state dict because it contains the quantized weights. And we're going to leverage a feature from PyTorch called the meta device. So the idea here is to first load the skeleton of the model in order to get the exact architecture of the model, the correct modules, and so on. And then once we have loaded that skeleton, we just need to replace all instances of linear layers with our quantized layers without quantizing the model, because we don't have access to the weights since all the weights are in the main device, meaning they're not getting initialized. And then once you have replaced all linear layers, you just have to call model dot load state dict by passing the quantized state dict. Since the state dict will automatically assign the correct weights on each module. So that way you save your cpu ram because you don't have to load your original model. First, you directly load, the quantized version of the model by loading the state dict. And also you're leveraging the meta device from PyTorch, where you just have to load the skeleton of the model instead of loading the whole model itself. So in terms of code, it would look, something like this. So you need to first load the config of the model to get the details about the architecture of the model. And then you just have to initialize your model. And under the context manager torch the device meta, to load the model on the meta device. So let's try that out to see what happens exactly. So if you try to print the parameters of the model you would get something like this. So you have only two parameters but the tensors are not initialized at all. So you have a bunch of meta tensors that don't take any Ram. So you can have as many meta tensors as you want, but you have all your information about the model architecture. So the model skeleton is already there. If you have some linear layers, you have everything except the weights. So instead of calling replace linear with target and quantize, we're just going to replace the linear layers with the quantized layers. Perfect. And if we print the model, perfect. So now the next step of the workflow is to load the quantized dict. So for that, we're also going to leverage some methods from Hugging Face hub library called hf_hub_download. With that method you can load any file from the hub using this API and each of have downloads will return you the pass on your cache to the loaded object. So here I've already pushed for you the quantize weights on the hub, and I'm specifying, the path on the hub of the file that I want to get. And as you can see here, the file has only 166MB. So the model is, 125 million parameters model. So if we store the row state dict in half precision of the model, you would need 125 million times two, because in half precision, each parameter takes two bytes. So it would end up having two 250MB. Here the state dict is only 166 million parameters, because most of the weights are in eight-bit precision, so one byte one byte per parameter, 125MB. And then you have approximately 30kB between probably the language model head, but also the scales that are stored in Float16. And then we load the state dict because yeah this returns you the path to the cache. So we load the state dict as follows. And then since the model is loaded on the meta device, we need to call load state dict but with strict equal true and assign equal true. Perfect. All keys match successfully. So now our model has been loaded. It's ready to be used. Perfect. "Hello. Today I'm a student of the University of the Philippines. I'm a student. Philippines" and so on. The model. Bear in mind that the model is a small model, so it's probably more or less expected that you have some repetition. If you don't provide, a lot of context, let me try. Maybe another prompt. Hello to them. Give you a course, some of the history of the world and the history of the. Yeah. So this the model is still repeating, repeating a bit in itself, but maybe with some sampling methods. And by trying out a larger model, maybe you would be able to get better results. That's it for today's lesson. Yeah. So in the next lesson we're going to see some challenges. So in the lesson I mentioned the outlier features. For example for large language models when you quantize them. Some other challenges such as storing low bit weights, for example 2 or 4 bits. You're going to try your hands on, packing the weights in order to store 2 or 4 bit weights. And we're also going to cover the most recent state of the art approaches in order to address the outlier feature challenges, when it comes to quantizing large language models. So, yeah. See you on the next lesson.