Quantization in Depth - DeepLearning.AI

Loading...

Welcome back!

We'd like to know you better so we can create more relevant courses. What do you do for work?

Subscribe to receive AI news, events and course updates from DeepLearning.AI!

Course Syllabus

AI Python for Beginners is a sequence of 0 connected courses. You can navigate to the other courses by clicking on the cards below

Explore Courses
Community
My Learnings

You’ve achieved today’s streak!

Complete one lesson every day to keep the streak going.

Su

Mo

Tu

We

Th

Fr

Sa

You earned a Free Pass!

Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.

We're going to wrap up the whole course with these explanations. If you have followed the previous lab, I've quickly mentioned, the notion of emergent features for large language models. And this is part of like one of the biggest challenges when when it comes to quantizing large language models. Once in the open source community, we had more and more large language models such as OPT the opened pre-trained transformers from Facebook. In 2022, researchers started to directly dive into the capabilities of the model, and they discovered some so-called emergent features at scale. What do we mean exactly by emergent features? Simply, some characteristics or features that appear at scale so when the model is large. So it turns out that for some models that scale the features predicted by the model, meaning the magnitude of the hidden states started to get large, thus making the classic quantization schemes quite obsolete, which led to, you know, classic, linear quantization algorithms, just failing on those models. Many papers today, since open sourcing these large language models, decided to tackle this specific challenge on how to deal with outlier features for large language models. Again, outlier features simply means hidden states with large magnitude. So there are some interesting papers, such as Int8, SmoothQuant, AWQ, and I wanted to give a brief explanation of each paper to just give you some insights of what could be the potential solutions to address this specific issue. So LLM.int8 proposes to decompose the underlying matrix multiplication of the linear layers in two stages. So if you consider the input hidden states that you can see in the big matrix here, it is possible to decompose the matmul in two parts. So the outlier part, all the hidden states that are greater than certain threshold and the non outlier part. The idea is very simple. So you decompose the input into perform the non outlier part matrix multiplication in Int8. So you quantize you do the matmul in eight-bit and then you do dequantize using the scales so that you get the final results in the input datatype. And the second part, you do it classically, with the original dtype of the hidden state. So usually in half precision. And then you combine both results. So this way it has been proven that you can, retain the full performance of the model without any performance degradation. Another very interesting approach is called SmoothQuant. SmoothQuant specifically applies to A8W8 schemes. Meaning we also want to quantize the activations. So meaning both the activation and the weights are in eight bit precision. So the paper also tackles this issue of outlier features in large language models. And they proposed to mitigate that by smoothening both the activation and the weights. Given a factor that you determine based on the input activation to migrate the quantization difficulty in both during the quantization of the activations, but also quantization of the weights. So that way you transfer the quantization difficulty or all over to the weights equally to the weights and to the weights and the activation. And that way you can also retain the full capabilities of the model. A more recent paper called AWQ, also treats the outlier feature in a special way. So the paper, which came out also from the same lab as the SmoothQuant paper, proposes to first iterate over a dataset that we are going to call a calibration dataset to get detailed idea of which channel in the input weights could be responsible of generating outlier features called salient weights. And the idea is, to use that information to scale the model weights before quantization, and also use that scale during inference to rescale the input as well. So these are just a few of them. There are numerous other papers that specifically address this issue for an effective and efficient large language model quantization. So here is a non-exhaustive list of those quantization techniques. But perhaps you can find much more at the time we speak. So yeah, if you are curious about this, I invite you to read these papers in detail. And, you know, just dive into them and try to understand these papers. These are one of the challenges when it comes to quantizing large language models, because the models are quite large. You can get some surprising behavior. There are also other challenges. So it seems the Quantization, Aware Training field seems to be a little bit maybe underexplored today. So training models in low bit could be also an interesting topic to dive into. there is also this challenge on limited hardware support. So right now for this course we only focused on W8A16 scheme, meaning the weights are in eight bit but the activations are in 16 bits. But for a more efficient quantization scheme, you may be also interested in other schemes such as W8A8 as well. But not all hardwares do support eight bit operations. There is also this challenge around calibration dataset. So for some quantization methods, you need to have a calibration dataset to perform some sort of, model pre-processing to make the quantization model better. And also in terms of distribution packing and unpacking. So yeah, if you are really interested about this topic, I invite you to do some further reading through for example, the state of the art quantization papers. There is also a lab called MIT Han lab, which made some of these, state of the art quantization papers. So they have also good resources on which you can learn more about this topic. You can also check out the Hugging Face Transformers quantization documentation and blog post. You can also, have a look at the llama.cpp repository discussions where you can find really some insightful experiments and talk. You can also check out Reddit. So there is a subreddit called r/LocalLlama where they share a lot of cool insights about quantization and you can also you can also learn more about the new method that come up and so on. And then of course, probably missing many more resources. but yeah, these are the ones that, that I know. So that's it for this lesson. So I hope you learned a lot, through this course and that you can use, the things that we have showed, to you, for your work or for your projects and that all of this could give you some ideas of cool things that you can do around you. So, yeah, we're going to move on to the next video. Yeah. We're we'll say thank you for, going through this course and suggest potential next steps. See you there.

How Was Your Experience

Thank you for taking the time to provide feedback on your course experience! Please take a moment to rate the course and share any comments you may have.

Would you recommend this short course to people in your network? (0=Not likely, 10=Extremely likely)
012345678910
Feedback about the Course:
Feedback about the Platform:

Quantization in Depth

Quantize and De-quantize a Tensor

Get the Scale and Zero Point

Symmetric vs Asymmetric Mode

Finer Granularity for more Precision

Per Channel Quantization

Per Group Quantization

Quantizing Weights & Activations for Inference

Custom Build an 8-Bit Quantizer

Replace PyTorch layers with Quantized Layers

Quantize any Open Source PyTorch Model

Load your Quantized Weights from HuggingFace Hub

Weights Packing

Packing 2-bit Weights

Unpacking 2-Bit Weights

Beyond Linear Quantization

Course Feedback

0%