Quantization Fundamentals - DeepLearning.AI

Loading...

Welcome back!

We'd like to know you better so we can create more relevant courses. What do you do for work?

Subscribe to receive AI news, events and course updates from DeepLearning.AI!

Course Syllabus

AI Python for Beginners is a sequence of 0 connected courses. You can navigate to the other courses by clicking on the cards below

Explore Courses
Community
My Learnings

You’ve achieved today’s streak!

Complete one lesson every day to keep the streak going.

Su

Mo

Tu

We

Th

Fr

Sa

You earned a Free Pass!

Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.

In this lesson, you will load your machine-learning models using different data types such as float16 or bfloat16 and study their impact on the model's performance. Let's get started. Welcome to this new lab session. In this lab, you will put into practice things that you have learned in the first lab session. Specifically, you will see how to load some ML models in different data types such as float32, float16, or bfloat16. You will also learn how to load popular generative AI models in different precisions and study the impact of loading these models in half-precision on their performance. And you will also learn how to load any model inside your workflow with your desired half-precision data type out of the box. So at the beginning of our lesson, we will try to inspect the data type of a model. So what do we mean exactly by the data type of a model itself? So recall, for an ML model, each model's layer contains some weights that are going to be used for inference, meaning, when you get the model’s prediction. And each weight is stored usually as a matrix of learnable parameters, which can be represented in different precisions. So for example, in this dummy model, we have, let's say, 12 layers, and each weight has n parameters that each of them are stored in 32-bit precision. Therefore, inspecting the model's data type is equivalent to inspecting the data type of a model's weights. So let's get started. So we'll first try to inspect the model's data type using a dummy model. So for that, we've prepared a dummy architecture. So we'll just import it from the helper methods. And then we'll load the model as follows and print it. As you can see here, this is a small model that has a token embedding layer, a linear layer, a layer norm layer, another linear layer, and a last layer, norm layer. And the language model head. So this is equivalent to having a very small and dummy language model. So we said we wanted to check the data type of the model. OK, so for that, we'll just use a utility method from PyTorch called named parameters that you can call on a module to loop into each module's parameter and its name. So we'll just loop into the named modules as follows. And simply print the name of the module together with the dtype of the module. Print. And to access a data type of a parameter, you just have to call param.dtype. All right, so let's test the method out right now. Perfect. So as you can see, we're able to print the model weight's name. together with the dtype of each weights. And as you can see, all model weights are loaded in float32, which is the default for PyTorch. So let's see now how we can cast the model into different precisions, such as float16 or bfloat16. So in order to cast any PyTorch module into, let's say, float16 or bfloat16, the API is pretty much straightforward. So let's say your target dtype is float16. You simply have to call one of these two methods, so either model.to() your dtype or model.half() for float16 or model.bfloat16() for bfloat16. So let's try that away and let's print also the dtypes of each model's parameter using the method we've defined before. So we're going to call print_param_dtype() on model_fp16 and as you can see all the model's weights have been converted successfully to float16. All right great! Let's say now I want to use the model and just perform a simple inference on the model. So recall I'm using a CPU instance here, so we're going to see if this is possible to use it out of the box. So let's define a dummy input, So here we're using... so a long tensor corresponds to the IDs of the tokens that you're going to pass to your transformers model because recall the architecture of the model which is a transformers-like model and it has an end and an end. So let's define a transformers-like model and it has an end. So let's first apply the an embedding layer as the first layer of the model. So the embedding layer expects to have a long tensor as input, and the embedding layer will output hidden states in floating point precision. So either FP32 if your model is in float32, or in float16 if your model is loaded in float16. All right, so we have our input. Let's first try to do an inference with the float32 model and see that it works. Perfect, we can also print the final logits. All right, and then we'll try to perform an inference with the FP16 model and see if it fails. So the reason it fails with this error, so “ADDMM not implemented for half,” means that some CPU kernels are not implemented for FP16. So one of the disadvantages with PyTorch and FP16 and CPU is that for most transformer-based models, you are not able to use these models out of the box in float16. So one way to overcome this issue is instead of loading your model in float16, you can also load it in bfloat16 and perform inference. Let's create a new dummy model, cast it in bfloat16, and get the logits of the model. So we're going to create a copy of the model using deepcopy. So that we have the same weights across the BF16 and the FP32 model. And we're going to cast the model in bfloat16. All right, let's print the parameters as a sanity check. Perfect, so your model is now in BF16 and let's get the logits of your model using the same input. Perfect, and then we can compute the mean difference, the errors between those two logits to compare if there is any huge gap, any gap between those two computed logits. So we're going to run this cell just to have an idea. Perfect. So we can see that there are very small differences that can be observed between the full-precision model and the BF16 model. But in practice, when you switch from FP32 to BF16, this doesn't really lead to a huge performance degradation in practice, even for large models. So, casting the FP32 models into BF16 are most of the time, if not all the time, performance cost-free. In the second section of this lesson, let's see how to load some popular generative models in different data types and see, kind of study their impact on their performance. So we're going to load a multimodal model, meaning a model that can take many modalities as input. So we're going to load a BLIP model that can take a text and image. And predict some text. So we're going to use a model called Blip image captioning that can perform image captioning. So you pass an image, you can pass an optional text, and the model will try to describe what's in the image, given the context that you have passed to the model. And if you're interested in knowing more about this model and also other models in the Hugging Face ecosystem, you can also have a look at our short course called Open Source Models with Hugging Face, where we, show you how to load these models and how to build fun and cool demos around these models. Yeah, so let's get started. So we just have to import this class, so Blip for conditional generation from Transformers. And as I said, we're going to use this model. So Blip image captioning base. And to load the model, nothing simpler than just calling from pre-trained and the model name. So Transformers by default loads the model in full precision, so float32, which is the default for PyTorch. We can confirm that using the method we have designed. So if you print the dtype of each model's parameter, obviously you have a lot of parameters now because the model is larger than our dummy model, but that's how you print each parameter's dtype. And as you can see, all of them should be in Float 32, which is the expected value for us. Perfect. We can also learn more about the so-called memory footprint of the model, meaning how much in terms of memory, so megabytes, gigabytes, does the model takes in memory. So for that, we can just call model.get_memory_footprint() to get the memory footprint of the model. And we can print the values in bytes, but also in megabytes. Yeah, so the Float32 model takes approxiamtely 990 MB. And let's see how to load the model in different precision, such as float16 or bfloat16, as we've seen before. So the canonical way in Transformers to load models in different precisions is to pass the parameter torch_dtype equals your target dtype directly in fromPretrained. So we're going to do that and load our bfloat16 model, because float16 doesn’t work on CPU for us. All right, so once we have loaded the model, we can directly check the memory footprint of the model. And we can also print the relative difference between the two memory footprints and see how much did we gain in terms of memory. So yeah, as you can see, the bfloat16 model is half of the size of the FP32 model. So we just halved the size of the model by just passing torch_dtype equals torch.bfloat16. So you may be wondering now, how does this affect the model's predictions or the model generations? Is this reduction for free? So we're going to see that now by just getting some qualitative comparison between the two models. So according to the model's model card, this should be the way to load the model, load also the model's processor and image, and get some generation. So we're going to do that and try to get some generations with both models. So we're going to load the processor first. And obviously, we can use the same processor for both models, so there is no difference. And we're going to load an image from the internet and display it for you. All right, so it's just a simple image on the beach with a dog and a woman. And we wrapped the whole generation pipeline in a small get generation helper method for you so that it's easier to get the model's generation. So we're just going to call that method by passing the processor, the image, and also the dtype of the model. So we can get the results of the full precision model as follows. So let's print the model's prediction. Perfect. So a woman sitting on the beach with her dog. Nice. Yeah. Let's just try out with the BF16 model and qualitatively compare both results. All right, so we got pretty similar results between the two models. So the only difference is that the FP32 model predicted with her dog, whereas the BF16 model predicted with a dog. But in both cases, the results seem pretty consistent with the image, pretty accurate. The reason it affected the generated token here is that all the errors between the FP32 logits and the BF16 logits gets accumulated across layers and layers. And since the model is an autoregressive model, meaning it uses the results of the previous iteration to get the results of the new iteration, all these errors gets accumulated until at some point impacting the model's prediction. But overall, it doesn't really affect the overall performance of the model. And you can expect to use out of the box bfloat16 if you are on CPU or float16 if you are using a GPU. All right, so before wrapping up the lesson, so I wanted to give a quick heads up on how does the torch_dtype argument works under the hood in Transformers and how you can adapt it in your workflow. By that, I mean in the current workflow, there is a small issue. So we need to first load the model in float32, and then cast the model in, let's say, float16 or bfloat16. That can be an issue in practice, for example, in production, because you have to load the bigger model first and then cast it in Float16 or BF16. You might want to directly be able to load out of the box the model in your desired precision without having to first load it in full precision to save memory. So under the hood in Transformers, we call a utility method in PyTorch called set_default_dtype(), where you pass the desired dtype. And then that way, when you initialize your model, it directly gets initialized in your target D type. So we'll see how to do that. Let's say I want to initialize my model directly in BF16. So we just have to call the torch.set_default_dtype() torch.bloat16. And then I can initialize my model and it should be automatically casted in bfloat16. Perfect. And once you have done that, don't forget to reset the default dtype in Float32 so that you may avoid some unexpected behaviors. If, let's say, you want to keep some initializations of other, I don't know, tensors or inputs in Float32, then you should revert back to the default D type. But this shouldn't affect the dtype of your model that you have already loaded. So that's it for this lesson. I invite you to try out these approaches on other models. You can also try out to load different models from Hugging Face hub in different precisions. You can also try out different modalities. You can, let's say, try out audio models, vision models, and load them in different precisions and study a bit their impact and play with them. If you find this trick also useful, don't hesitate to try it out in your workflow as well. So in this lesson, you have learned how to load models in half precision. So either in FP16 or BF16. In the next lesson, you will learn how to use Hugging Face's Quanto library in order to load your models in int8 precision by quantizing them. So yeah, let's move on to the next lesson.

course detail

How Was Your Experience

Thank you for taking the time to provide feedback on your course experience! Please take a moment to rate the course and share any comments you may have.

Would you recommend this short course to people in your network? (0=Not likely, 10=Extremely likely)
012345678910
Feedback about the Course:
Feedback about the Platform:

Loading...

Learn Code

Next Lesson

Quantization Fundamentals

Introduction

Handling Big Models

Data Types and Sizes

Loading Models by data type

0%