Quantization in Depth - DeepLearning.AI

Loading...

Welcome back!

We'd like to know you better so we can create more relevant courses. What do you do for work?

Subscribe to receive AI news, events and course updates from DeepLearning.AI!

Course Syllabus

AI Python for Beginners is a sequence of 0 connected courses. You can navigate to the other courses by clicking on the cards below

Explore Courses
Community
My Learnings

You’ve achieved today’s streak!

Complete one lesson every day to keep the streak going.

Su

Mo

Tu

We

Th

Fr

Sa

You earned a Free Pass!

Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.

In this lesson, you'll get a sense of some common challenges when it comes to applying low-bit quantization, such as 2 or 4 bits by diving into weight spiking. In addition, you'll wrap up the course by some insights into state of the art quantization methods. Let's pack some weights. In this lesson, we are going to discuss about the common challenges that you can face when you want to try out low bit quantization, such as 2 or 4 bit. And we're going to implement from scratch weight packing. So specifically in this lesson you will learn why weight spiking is important for storing quantized weights. We'll also store and load two and four-bit weights in a packed unsigned int8 tensor. And we will also see together other challenges with quantizing generative models such as LLMs. And quickly review some state of the art LLM quantization methods. So let's get started. So before starting the lab, I wanted to give some small context on why packing is important and why do we need packing when storing quantized weights. So assume you have quantized. You want to quantize your model in four-bit precision, and you want to store the weights in a torch tensor. So ideally you want to call something like this. Or you want to create a tensor with some values. And then probably pass dtype=torch.int4. Or you can also do it after cast to tensor int4. But the problem is that the time at the time we speak, there is no native support for four-bit weights in PyTorch. So we need to find a way to store those four-bit weights in an efficient manner. So right now the only possible solution is instead of saving the tensor in four-bit, we have to save it in eight-bit as currently it's the data type with the smallest precision that is available in PyTorch. So in practice we need to save the tensor in eight-bit. But this is not really ideal because the tensor will occupy eight-bit per data point. Despite in practice it will only need four-bits because you have encoded your parameters in four-bit precision, so it will definitely add considerable overhead for large models. Therefore, if we go for the naive approach, meaning if we store the four-bit weights in an eight-bit tensor, there will be no point quantizing the model into four-bit because all the parameters will be stored in eight-bit precision. So for that, we need to pack the four-bit weights into eight-bit tensor. So how those packing work in detail. So consider the tensor below that stores four values that can be represented in two-bit precision. So recall in two-bit precision you can encode four values. So in case of base two we can encode 0123. So we can code at most four values two to the power of two. And those values will be 012 and three. So imagine we have the a parameter of a model which we have encoded in two-bit precision. And these are the parameters of the model. So right now in PyTorch we can store the model weights in two-bits. So we have to store them in a bit precision. So we'll have to end up with such a tensor that will take four times eight-bits in terms of memory memory footprint. So currently this weight tensor is encoded as so, 1 in 8 bit, 0 in 8 bit, 3 in 8 bits and 2 in eight-bits. So as I said this is not really optimal because you need to allocate four times eight-bits in terms of memory in order to store weights that can be encoded only in two bit. So what can we do to ignore these bits that we don't need? That's exactly what packing does and addresses this challenge by packing only the relevant bits all together in a single eight-bit tensor. So if let's say we're going to pack these four weights in a single bit tensor. So we're going to start with the right one. Then we're going to insert it in our new eight-bit parameter. So one zero we're going to put one zero on the first bits in the first bits of our new eight-bit parameter. And then 110001. And if we store that in eight- bits, we'll end up having a new tensor with only a single value instead of four values. But this time this tensor encodes all the parameters that are stored in two-bits. So this value in uint8 will end up being 177. So the advantage of packing is that it reflects the true or real memory footprint of the quantized weights. So again, if we go for the naive approach when we need to allocate four times eight-bit precision, whereas for the packed case we only need to store a single parameter in eight-bit precision that will store all the two bit parameters that we have. Of course, this has to come with a price. Whenever we want to perform inference, we need to unpack the weights to come back to this state, because, most of the operations are not supported in native two-bit or four-bit in PyTorch. And also, the unpacked tensors need to have a shape with a multiple of n divided by the number of bits. And so if we have five parameters, we'll need to allocate an extra eight-bit parameter here that will only encode a single two-bit value. So ideally we need to have eight divided by nine bits in case of two. Four we need to have multiple of four parameters in the single tensor. Yeah. So let's see how does it looks like in terms of implementation. And we're going to move on to the lab.

How Was Your Experience

Thank you for taking the time to provide feedback on your course experience! Please take a moment to rate the course and share any comments you may have.

Would you recommend this short course to people in your network? (0=Not likely, 10=Extremely likely)
012345678910
Feedback about the Course:
Feedback about the Platform:

Quantization in Depth

Quantize and De-quantize a Tensor

Get the Scale and Zero Point

Symmetric vs Asymmetric Mode

Finer Granularity for more Precision

Per Channel Quantization

Per Group Quantization

Quantizing Weights & Activations for Inference

Custom Build an 8-Bit Quantizer

Replace PyTorch layers with Quantized Layers

Quantize any Open Source PyTorch Model

Load your Quantized Weights from HuggingFace Hub

Weights Packing

Packing 2-bit Weights

Unpacking 2-Bit Weights

Beyond Linear Quantization

Course Feedback

0%