In this lesson you learn how large language models work and how they understand text. Then, you will learn how to combine LLMs and multimodal modals into language vision models using a process called Visual Instruction Tuning? And finally, you use all these models in practice. All right, let's go. Current LLMs are all generative Pre-trained transformers. For example, Llama2 , Chat-GPT or Mistral. This class of models are autoregressive because they generate one token or one word piece at a time. Further tokens that are generated only depend on previously provided or generated tokens. And then these models have been trained in unsupervised manner by predicting the next word on trillions of tokens. In this training process, the output a probability over all possible next tokens. And we train them to get this correct. Let's take a look at an example. Jack and Jill went up the blank, which outputs a score for every token. We're more pro tokens like mountain and hill will have higher scores, while less protocol tokens like apple and llama will have lower scores. You can think of these scores as a normalized probabilities. This scores are then normalized two percentage points like this. Given a prompt: "the Rock". We want to see how the model completes this. We use one or two vectors to represent each word and then look up the embedding for each token. Once we get the embedding for a token, the transformer model would try to generate the next word by paying attention to the first word. It always starts off with beginning of the sentence token, and because we know the first two words are the rock, we can force the model to output them. And in fact, this is actually how the model is trained. And once the model outputs rock, we input this word representation as one on the vector back in. Now the model looks up the embedding for rock and also pays attention to previous words in the same sentence and outputs a probability over the next possible tokens that we can sample from. And here we can see that we sampled the word rolls. We pass this forward and generate the next word, which is along. This can keep going until we hit the token limit or the end of the sentence token. And this completes the generation. And because the output is probabilistic, each time we generate, we could actually get different results. So let's look at another possible completion. Let's say we get the word skips and we pass this word forward, and the next word that we get is fast. If we keep something, we can get a completely different generated response. Here's a simple example of an image classification model. Which takes in an image and outputs a class label. A vision transformer works quite well for this classification task. It processes an image as patterns instead of individual pixels, which makes the analysis much more efficient. So let's look at this in more detail. Each part of the image gets vectorized and passed into the transformer model, the transformer, can choose to pay attention to any part and is optimized to output the correct label. Now let's see how you can use visual instruction tuning to train and LLMto process images along with text, given an image and a text instruction. And in this case, it is the starry night. And the question is "who drew this painting?" You can train the model to output the correct answer in text, which is, of course, Vincent Van Gogh. Now let's look at an example of visual instruction tuning. You are going to start off with your image, the Starry night painting, and then you're going to cut it up into patches. And you also have a text instruction "who do this painting." so you are going to take the purchase of our image and embed them into vectors as seen here. And you are going to take your tokens from our sentence into an instruction and you're going to embed that into vectors as well. Now, the language model is going to be trained to understand and pay attention to both of the image patch tokens as well as the language token, and it has to output the correct tokens for the answer. Vincent Van Gogh. This is known as visual instruction tuning because you are given a visual as well as instruction and you know what the right answer is. You can optimize the probability that the model generates the right outputs token in the process, learns to understand images. After an LLM is trained using the visual instruction tuning, you can now process images as well as text. You can now think of the model as a large multimodal model, an LLM. You can also ask questions about objects in the images for example, ask for detailed structure description of the image content like this. "Describe this picture for me." Let's now see all of this in practice. In this lap you use images and text as input, then you get LLMs to reason over it. All right, let's code All right. So let's start by adding a command that will ignore all the unnecessary warnings. Now, let's do a bit of a setup. So in this lesson will be using Gemini provision. So what we need to do is load our API keys and then also as part of this, what we have is a genai library where we need to pass in the key. So let's load some helper functions. So now what we need is a function that will take a piece of text and extract it and turn into readable markdown. And now let's construct a function that allows us to call LLMs. So that function will take an image path and a prompt. And the first thing we need to do is load that image. And the next thing that we need to do is call the genitive model for Gemini provision with the prompt and the loaded image. And then finally, we need to return the result. And that's where we're going to use that to mark down function, which would take this text from the response and then pass into something nice and readable. So now we can start analyzing some images, and we are going to start with this beautiful index historical chart. So now let's call our and then function. And given the file and a prompt which says, "explain what you see in this image." I would try to get them to analyze this chart, And usually that takes a couple of seconds. And now we see a nice description which basically says that the image shows historical charts from S&P 500 and gives us a pretty nice analysis of what we really see in here. And this could be quite helpful, And now let's try to analyze something harder. So we're going to use the graphic that we use in our slides and we ask the LLM to give us a hand in explaining what this figure is actually used for. So this is the result. And in here you can see that the model recognize that this was an image that was used for contrastive Pre-training framework, which is actually pretty accurate. And then actually explains about like different types of encoder for text and image and so on and so on. And so maybe I should have used it for preparing this lesson. So here's a fun example that I want to go over. So if you look at this, this is just a green blob. There's nothing special about it. So I'm kind of curious what this LLM can come up with when he's trying to analyze this. So let's ask the LLM to see if it can see something special about this image. Let's execute it. We can see that anyone recognized that there is a hidden message that says you can vectorize the whole world with Weaviate. So let's try to run this function that will show us where this message was hidden. And then, in here what we are actually doing is we're looking for anything that was in the first channel, what a value was over 120. And by running this, we can see that this is the message that was really hidden there. So anything that was over 20 became white and under is black, and that's how the LLM was able to decode the message and tell us about it. So LLMs actually don't see the way we see. They can actually see a lot more and be more inquisitive about some of the images that they're looking to and that's how they can decode this kind of stuff. I will include a function in a resources for this notebook. So if you want to create an image with a hidden message like this one, you'll be able to do it just like that and send it to your friends. All right. So in this lesson, you learn how to use image vision models and how to actually analyze images to gather with text prompts. And then in the next lesson, you will build a multi model RAG up, See in an accessory.