Quantization Fundamentals - DeepLearning.AI

Loading...

Welcome back!

We'd like to know you better so we can create more relevant courses. What do you do for work?

Subscribe to receive AI news, events and course updates from DeepLearning.AI!

Course Syllabus

AI Python for Beginners is a sequence of 0 connected courses. You can navigate to the other courses by clicking on the cards below

Explore Courses
Community
My Learnings

You’ve achieved today’s streak!

Complete one lesson every day to keep the streak going.

Su

Mo

Tu

We

Th

Fr

Sa

You earned a Free Pass!

Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.

In this lesson, you will learn about the common data types used to store the parameters of machine-learning models. This lesson is essential to better understand quantization, since quantization is achieved by converting numerical values to a different data type. Let's get started. Let's start with the integer data type. An unsigned integer data type is used to represent a positive integer. The range of an n-bit unsigned integer is 0 to 2 to the power of n minus 1. For example, the minimum value of an 8-bit unsigned integer is 0, and the maximum value is 255. The computer allocates a sequence of 8 bits to store the 8-bit integer. For an unsigned integer, the decoding process is as follows. If the bit is equal to 0, its value is 0. If the bit is equal to 1, the decoded value is a power of 2. For the first bit, it is equal to 2 to the power of 0. For the second one, it is equal to 2 to the power of 1, and so on. In this example, as you can see, in the first position of the sequence, we have a bit equal to 1, so the decoded value is 2 to the power of 0. For the second one and the third one, both bits are equal to 0, so we have 0 and 0. For the fourth position bit, we have a 1, so the decoded value is 2 to the power of 3, and so on. And at the end, we add those values, so we have 128 plus 8 plus 1, which equals 137. For the signed integer data type, it is used to represent a negative or positive integer. There are multiple representations, but the one we will look into is the 2s complement one, since it is the most common one. The range is minus 2 to the power of n minus 1 and 2 to the power of n minus 1 minus 1. So for an 8-bit signed integer, the minimum value is minus 128 and the maximum value is 127. The difference with the unsigned integer is that for the bit in the last position, as you can see here, the value is negative. So if we process the same sequence as earlier, we need to add a minus here, and the result will be minus 128 plus 8 plus 1, which will be equal to minus 119. This way of processing the sequence brings some questions. Does the addition between two signed integers work? Let's have a look at a quick example with 4 bits to convince ourselves. As you can see, the first sequence represents 2, and the second represents minus 2. The addition of these two sequences should give 0. So let's do that. We have 0 and 0 here, so it gives 0. We have 1 and 1, so we have 0, but 1 is carried on the left. So we have 1, 0, and 1, so we get 0, and 1 is carried to the left. We have 1, 0, 1, so we get 0, and 1 is carried to the left. However, since we only store 4 bits in total, we don’t save the last bit that was carried on the left, and, at the end, we end up with 0, as you can see here. Creating data with integer data types is very easy in Pytorch. You just need to set the correct torch.dtype. As you can see in this table, to create an 8-bit signed integer you just need to pass the following torch.dtype torch.int8. For an 8-bit unsigned integer, you just need to pass torch.uint8, u stands for unsigned. In Pytoch, you can also create 16-bit signed integer, 32-bit signed integer, and 64-bit signed integer. For example, let's check the information about the 8-bit unsigned integer. To do that, we will use torch.iinfo and we need to pass the torch.dtype we want to check. For this classroom, the libraries have already been installed for you. If you are running this on your own machine, you can install the torch library by running the following, pip install torch. Since the libraries have already been installed in this classroom, we won't be running this command and I'll just comment it out and we can just type the command we talked about earlier. So torch.iinfo and we will pass, for example, torch.uint8 to check the 8-bit unsigned torch.dtype and as you can see, we get that the minimum value is 0 and the maximum value is 255 which makes sense with what we saw earlier. Let's do the same for the 8-bit signed integer. As you can see here, the command is the same but we just need to pass this time torch.int8 and as you can see, we get that the minimum is minus 128 and the maximum is 127 just as expected. Great! Let's move on. Now is a good time to pause the video and you can try other data types on your own. For example, you can try the following data types torch.int64 torch.int32 and torch.in16. And you can check if the results you get make sense with what we saw in the theory part. Now let's move on to floating point representation. There are three components in floating point representation. We have the sign. Only one bit is needed since a number can be either positive or negative. We have the exponent, which determines the range of the number how big in magnitude it can be in both the positive and negative direction Lastly, we have the fraction. The fraction determines the precision of the number. By precision, I mean, can you define a number as 0.4999 or only as 0.5. Floating point 32, bfloat16, floating point 16, and floating point 8 are all floating point data types with a specific number of bits for the exponent and fraction. Let's have a look at floating point 32. Floating point 32, or FP32, in short, is composed of one bit for the sign, 8 bits for the exponent, and 23 bits for the fraction. And if you add, them, we will end up with 32 bits. Here’s the range for floating point 32 for positive values. We can respresent a very small number, as small at 10 to the power of minus 45, and as big as 10 to the power of 38. As for negative values, this is the same range but with a minus in front of each value. So the minimum value is minus 3.4 times 10 to the power of 38. For floating points, we have two formulas to decode the sequence. One to represent very small values, which are also called subnormal values. And the other one to represent very big values, called normal values. Don't worry too much about these formulas down here. The point here is to see how big and how small a number you can store using floating point 32. As you will see how this differs with other data types. This data type is very important in machine learning since most models store their weights in floating point 32. For floating point 16, we only have 6 bits for the exponent and 10 bits for the fraction. So the smallest positive value you can represent is 10 to the power of minus 8. And the biggest is 10 to the power of 4. Compared to floating point 16, bfloat16 allocates 8 bits for the exponent and 7 bits for the fraction. As you can see in the range, you can represent very small values and very big values. However, the downside is the precision, which is worse than floating point 16. To sum up, FP32 has the best precision and the range is also very big. FP16 has a better precision than bfloat16, but the range is smaller. Lastly, the bfloat16 range is close to the range of floating point 32, but it has a worse precision. The nice thing about floating point 16 and bfloat16 is that they take up half of the space of floating point 32. Let's see how you can use them in PyTorch. Here’s the table with floating data types in PyTorch. For example, as you can see, to create a 16-bit floating point, you need to set the torch dtype to be equal to torch.float16. To create a 16-bit brain floating point, you just need to set the torch dtype to be equal to torch.bfloat16. In PyTorch, you can also create the 32-bit floating point and a 64-bit floating point. Now, let's see what happens when you convert a Python value to a PyTorch tensor with a specific data type. First, we create the value 1/3. In Python, the value we created is stored in floating point 64. So, if we create a tensor with a torch dtype to be equal to the torch.float64, we shouldn't see a difference. Let's do that. So, we first create the value 1/3 Then, let’s check the value. And as you can see, the value we passed does not correspond exactly as it is. It is converted to floating point 64. And the value that is stored inside the computer is an approximation. Now, let's create a tensor with a dtype equal to floating point 64. To do that, we need to call torch.tensor. We put the value. And we need to specify the dtype argument to be equal to torch.float64. Let's have a look at the value. So, as you can see, we have the same results. Now, let's do that for other data types. We will do that for floating point 32. Floating point 16. And bfloat16. And let's check the results. So, to create the floating point 32 dtype tensor, we just need to change the dtype to be equal to torch.float32. And let's do that also for floating point 16 and bfloat16. Then, let's print all the results together so that we have a good comparison. From these results, we can make the following observation. The less bits we have, the less precise the approximation will be. And for bfloat16, as we said before, the precision is worse than floating point 16. So, this is why, as you can see here, the approximation is worse than the floating point 16. But bfloat16 has a bigger range than floating point 16. You can check this information directly from PyTorch using the function torch.finfo. Let's do that for bfloat16, for example. Let's compare this information with floating point 32. So, we just need to change the torch dtype. And as you can see from these results, the minimum value and the maximum value are quite close. But you can see that the resolution of the floating point 32 is way smaller than the bfloat16. Now is a good time to pause the video and try it by yourself by changing the torch dtype to be equal to torch float 16 Or torch.float64. Now that we know how integer and floating point work, we can have a look at downcasting. Downcasting happens when we convert a higher data type to a lower data type. The value will be converted to the nearest value in the lower data type. A floating point 32 value, for example, 0.1, downcasted to an 8-bit integer will be converted to 0. You'll see that we have a loss of data. Let's check the impact on matrix multiplication. First, let's create a random tensor with torch dtype float32 of size 1000. To create the random tensor, we will use the "rand" function from torch. So, torch.rand The first argument is the size of the tensor. So, we will put 1000. And we need to specify the dtype of that tensor, which will be torch.float32. And that's it. Let's have a look at the tensor we just created. Since the tensor we created is very big, we will just look at the five first elements. And you can see that we indeed have random values in torch.float32. Now, let's downcast this tensor to bfloat16. To do that, we will use the ".to" method. And we just need to specify the dtype to be equal to torch.bfloat16. And now, let's have a look at the first five elements. And as you can see, we managed to downcast our tensor. we see that the D type is now equal to torch.bfloat16, and since we downcasted the tensor, we do not have the same values as the original one, but they are very close. Now, let's check the impact of downcasting on multiplication. First, let's do the multiplication with the original tensor in floating point 32. To do that, we will use the dot method from torch. So, multiplication with float32 equals to torch.dot, and we put as argument the two tensors in floating point 32. Let's check the results. And the results we get is this one, but the result you will get will be different since we initialized random tensors. Now, let's check the multiplication on the bfloat tensors. And as you can see, the result is quite close to the original one, but still, we do have a loss of precision. The advantages of downcasting is reduced memory footprint, we have a more efficient use of GPU memories, it enables the training of larger models, and also enables larger batch sizes, but also we have increased compute and speed. Computation using low precision, for example, floating point 16 and bfloat 16, can be faster than floating point 32 since it requires less memory, and it also depends on the hardware, whether you're using Google TPU or NVIDIA 100. The disadvantage is that it is less precise, we are using less memory, hence the computation is less precise. One of the use cases of downcasting is mixed precision training. We do the computation of these models in smaller precision, for example, floating point 16, bfloat 16, but we store and update the weights in higher precision, usually, it is floating point 32. Now, let's move on to the next lesson. Younes will present you with how to load models with different data types.

course detail

How Was Your Experience

Thank you for taking the time to provide feedback on your course experience! Please take a moment to rate the course and share any comments you may have.

Would you recommend this short course to people in your network? (0=Not likely, 10=Extremely likely)
012345678910
Feedback about the Course:
Feedback about the Platform:

Loading...

Learn Code

Next Lesson

Quantization Fundamentals

Introduction

Handling Big Models

Data Types and Sizes

Loading Models by data type

0%