Quantization methods are used to make models smaller, which makes them more accessible to the AI community. In this lesson, you'll get an overview of what Quantization is, and how it works. Let's get started. We have seen previously that quantization is an exciting topic as it enables us to shrink models for better accessibility to the community. In this lesson, we will learn how to implement some quantization primitives from scratch, and we will also implement our own model quantizer and cover some challenges that anyone can face when it comes to lower bit quantization, such as weights packing. Let's get started. So let's first have a quick glance on what we have learned so far from the first course. So, in the introduction of the first course, we listed all available techniques that one could use in order to compress a model in general. So first of all, quantization aims at representing parameters of the model in a lower precision. With knowledge distillation, you can train a student model using the bigger teacher model outputs. And finally, with pruning, you can simply remove some connections inside the model, meaning removing weights to make the model more sparse. We also covered common data types in machine learning, such as INT8 or float. We also performed linear quantization using Hugging Face's quantum library with few lines of code. And finally, we wrapped up the course with an overview how quantization can be leveraged in different use cases, such as large language models, finetuning. So let's see together what we are going to cover exactly in this course. So, first of all, we are going to deep dive together into the internals of linear quantization and implement some of their variants from scratch, such as per channel, per tensor or per group quantization. We will study what are the advantages and drawbacks for each of these methods, and we will see their impact on some random tensors. And next, we will try to build our own quantizer to quantize any model in eight-bit precision using one of the quantization schemes presented before. Note the quantization scheme is agnostic to modalities, meaning you can apply to any model as long as your model contains linear layers. Technically, you will be able to use your quantizer to quantize a vision, text, audio, or even a multimodal model. And finally, we will wrap up the course by learning more about some challenges that you can face when it comes to extreme quantization, such as weight packing, which is a common challenge these days. As of the time we speak, PyTorch does not have a native support for two-bit or four-bit precision weights. One way to address this issue is to pack these low precision weights into a higher precision tensor, for example INT8. And we will deep dive into that and implement packing and unpacking algorithms together. And we will end the course by covering what are the other common challenges when it comes to quantizing large models such as LLMs. And review together some state of the art quantization methods together. Yeah, so let's try to get started and shrink some models.