In this lesson, you'll get a sense of some common challenges when it comes to applying low-bit quantization, such as 2 or 4 bits by diving into weight spiking. In addition, you'll wrap up the course by some insights into state of the art quantization methods. Let's pack some weights. In this lesson, we are going to discuss about the common challenges that you can face when you want to try out low bit quantization, such as 2 or 4 bit. And we're going to implement from scratch weight packing. So specifically in this lesson you will learn why weight spiking is important for storing quantized weights. We'll also store and load two and four-bit weights in a packed unsigned int8 tensor. And we will also see together other challenges with quantizing generative models such as LLMs. And quickly review some state of the art LLM quantization methods. So let's get started. So before starting the lab, I wanted to give some small context on why packing is important and why do we need packing when storing quantized weights. So assume you have quantized. You want to quantize your model in four-bit precision, and you want to store the weights in a torch tensor. So ideally you want to call something like this. Or you want to create a tensor with some values. And then probably pass dtype=torch.int4. Or you can also do it after cast to tensor int4. But the problem is that the time at the time we speak, there is no native support for four-bit weights in PyTorch. So we need to find a way to store those four-bit weights in an efficient manner. So right now the only possible solution is instead of saving the tensor in four-bit, we have to save it in eight-bit as currently it's the data type with the smallest precision that is available in PyTorch. So in practice we need to save the tensor in eight-bit. But this is not really ideal because the tensor will occupy eight-bit per data point. Despite in practice it will only need four-bits because you have encoded your parameters in four-bit precision, so it will definitely add considerable overhead for large models. Therefore, if we go for the naive approach, meaning if we store the four-bit weights in an eight-bit tensor, there will be no point quantizing the model into four-bit because all the parameters will be stored in eight-bit precision. So for that, we need to pack the four-bit weights into eight-bit tensor. So how those packing work in detail. So consider the tensor below that stores four values that can be represented in two-bit precision. So recall in two-bit precision you can encode four values. So in case of base two we can encode 0123. So we can code at most four values two to the power of two. And those values will be 012 and three. So imagine we have the a parameter of a model which we have encoded in two-bit precision. And these are the parameters of the model. So right now in PyTorch we can store the model weights in two-bits. So we have to store them in a bit precision. So we'll have to end up with such a tensor that will take four times eight-bits in terms of memory memory footprint. So currently this weight tensor is encoded as so, 1 in 8 bit, 0 in 8 bit, 3 in 8 bits and 2 in eight-bits. So as I said this is not really optimal because you need to allocate four times eight-bits in terms of memory in order to store weights that can be encoded only in two bit. So what can we do to ignore these bits that we don't need? That's exactly what packing does and addresses this challenge by packing only the relevant bits all together in a single eight-bit tensor. So if let's say we're going to pack these four weights in a single bit tensor. So we're going to start with the right one. Then we're going to insert it in our new eight-bit parameter. So one zero we're going to put one zero on the first bits in the first bits of our new eight-bit parameter. And then 110001. And if we store that in eight- bits, we'll end up having a new tensor with only a single value instead of four values. But this time this tensor encodes all the parameters that are stored in two-bits. So this value in uint8 will end up being 177. So the advantage of packing is that it reflects the true or real memory footprint of the quantized weights. So again, if we go for the naive approach when we need to allocate four times eight-bit precision, whereas for the packed case we only need to store a single parameter in eight-bit precision that will store all the two bit parameters that we have. Of course, this has to come with a price. Whenever we want to perform inference, we need to unpack the weights to come back to this state, because, most of the operations are not supported in native two-bit or four-bit in PyTorch. And also, the unpacked tensors need to have a shape with a multiple of n divided by the number of bits. And so if we have five parameters, we'll need to allocate an extra eight-bit parameter here that will only encode a single two-bit value. So ideally we need to have eight divided by nine bits in case of two. Four we need to have multiple of four parameters in the single tensor. Yeah. So let's see how does it looks like in terms of implementation. And we're going to move on to the lab.