Our goal in this course is to build a question answering system. And we can actually do that already with just embeddings, but it turns out we can build an even better question answering system if we use the text generation capabilities of large language models. So, let's see how that works. We'll start off by importing our credentials and authenticating so that we can use the Vertex AI service. And we'll need to set our region, and then import Vertex AI and initialize the SDK. And once we've done the required setup, we can get started. So in this example, we'll start off by loading in the text generation model from Vertex. And before we were loading an embeddings model, this time we're loading this text generation model. When we load this model, it will be a different name. This model is called text bison instead of the embeddings gecko model we were looking at earlier. And the TextBison model is fine-tuned for a variety of natural language tasks like sentiment analysis, classification, summarization, and extraction. But note that this model is ideal for tasks that are completed with just a single API response, so not for continuous conversations. If you do have a use case that requires back-and-forth interactions, there's a separate model for that called ChatBison that you can check out instead. Now, when we talk about text generation in the context of large language models, these models will take as input some text and produce some output text that's likely to follow. And this input text we provide is called a prompt. So, let's start off with an open-ended kind of generative brainstorming task. We'll define our prompt as, I'm a high school student. Recommend me a programming activity to improve my skills. This, again, is the input that we will pass to our model. So, once we've defined our prompt, we can print out the response from our model. So, we will call predict and we will pass in the prompt and then we can print out the results. So, let's see what the model produces. The model suggesting that we write a program to solve a problem you're interested in. Definitely good advice. Also taking a programming course and a couple of other ideas here. Now, this is a pretty open-ended response, and it might be useful for brainstorming, but it's pretty variable. And with large language models, we can kind of get them to take on different behaviors by writing strategic input text. So, if we wanted to have maybe a more restrictive answer, we could take this open-ended generative task and turn it into a classification task to basically reduce the output variability. So, let's see what that might look like. Here is a rephrasing of that same prompt. It's just a little different. Now we're saying, I'm a high school student, which of these activities do you suggest and why? So, instead of just keeping it completely open-ended, we provided a few different options A, learn Python, B, learn JavaScript, or C, learn Fortran. So, let's see what happens when we pass this prompt to the model. Again, we'll be printing out the results of calling the predict function on our text generation model, and we're passing in this prompt text. So, this time the model is suggesting that we should learn a Python, And again, it's a pretty reasonable answer to our question, but it's just a little bit more restrictive based on our particular prompt. And this sort of art and science of figuring out what the best prompt is for your use case is called prompt engineering. And there are a lot of different tips and best practices. So if you're curious, you could definitely check out some of the other deeplearning.ai courses on prompt engineering. But for now, we're just going to try out one other task with this large language model. So, something I think is really interesting about large language models is that we can use them to extract information. In other words, take data that's in one format and reformat it into another format. So here, we've got this long chunk of text, but it's a synopsis for an imaginary movie about a wildlife biologist. So, in this imaginary movie synopsis, we've got the names of the characters and their different jobs and also the actors who played them. So, what we're going to do is instruct this model to extract all three of those fields. We'll instruct it to extract the characters, their jobs, and the actors who played them. So, this long piece of text is the prompt that we'll pass to the model. So, let's try and see what the response is. Here you can see that the model did in fact extract all of the characters in the synopsis, as well as their jobs, and then the actors who played them. And if we wanted to get this in maybe a different format, we could even say something like, extract this information from the above message as a table. And we can see what happens here. We actually get some markdowns. So, we can test this markdown and see if it is actually in fact valid markdown. Let's turn this into a markdown cell. And there we go. We got this nice table. So, we can actually use large language models to extract information and convert data from one format to another. So, let's review the key syntax that we just ran through. We first imported the text generation model, and then we selected the specific model we wanted to use, which in this case was TextBison, and this was the 001 version. And then, we defined a prompt, which is the input text, and we called predict, and we passed in this prompt. Now, in addition to adjusting the words and the word order of our prompt, there are some additional hyperparameters that we can set in order to get the model to produce some different results. So earlier, I said that these models take as input some text and they produce some output text that's likely to follow, but we can actually be a little bit more precise in this definition. Really these models take as input some text, maybe the garden was full of beautiful, and they produce as output an array of probabilities over tokens that could come next. And Andrew talked a little bit about tokens in a previous lesson, but tokens are essentially the basic unit of text that is processed by a large language model. So, depending on the different tokenization method, this might be words or sub words or other fragments of text. Now, I'm saying tokens, but in the slides here, I'm actually just going to show individual words and that's just to make it a little bit easier to understand the concepts. But again, these models are returning an array of probabilities over tokens that could come next. And from this array, we need to decide which one we should choose. This is known as a decoding strategy. So, a simple strategy might be to select the token with the highest probability at each time step. And this is known as greedy decoding. But it can result in some uninteresting and sometimes even repetitive answers. Now, on the flip side, if we were to just randomly sample over the distribution, we might end up with some unusual tokens or some unusual responses. And by controlling this degree of randomness, we can control how unusual or how rare the words are that get put together in our response. One of the parameters we can set in order to control this randomness is called temperature. Lower temperature values are better for use cases that require a more deterministic or less open-ended responses. So, maybe if you're doing something like classification or an extraction task like the synopsis for or an imaginary movie task we just looked at, you might wanna start with a lower temperature value. On the other hand, higher temperature values are better for more open-ended use cases. So, maybe something like brainstorming or even summarization where you might want more unusual responses or unusual words. Typically, with neural networks, we have some raw output called logits. And we pass these logits to the softmax function in order to get a probability distribution over classes. You can think of the different classes in this case as just being the different tokens that we might return to the user. So, when we apply temperature, what we're doing is we're taking our softmax function and we're dividing each of the logits values by our temperature value. So, on the slide here, you can first see our softmax function. And then below that, you can see the softmax function with temperature applied where each of our logits values z are divided by theta. And that's how we actually apply temperature. Now, if this didn't make a whole lot of sense to you, don't worry about it. The actual mechanics here aren't that important. What's really more important is that you get an intuitive understanding of how temperature works. So, one way to think about temperature is that as you decrease the temperature value, you're increasing the likelihood of selecting the most probable token. And if we take that to the extreme and we make a temperature value of zero, it will be deterministic. That means the most probable token will always be selected. On the other hand, you can think of increasing the temperature as basically flattening the probability distribution and increasing the likelihood of selecting less probable tokens. With the Vertex AI model we're looking at, you can set a temperature value between zero and one, and for most use cases, starting with a number like 0.02 can be a good starting place, and you can adjust it from there. So, let's see how we actually set this value in the notebook. Let's start off with the temperature value of zero. Again, this is deterministic, and it's going to select the most likely token at each time step. So here's a prompt. Let's say, complete the sentence. As I prepared the picture frame, I reached into my toolkit to fetch my, and we'll see what the model responds with. So, we will call the predict function as we've done before on our generation model. And we'll pass in the prompt, but this time, we're also going to pass in the temperature value. And then we can print out this response. So, the model says, as I prepared the picture frame, I reached into my toolkit to fetch my hammer. And that seems like a pretty reasonable response, probably the most likely thing someone would fetch from their toolkit for this particular example. And remember, temperature of 0 is deterministic. So, even if we run this again, we will get the exact same answer. So, let's try this time setting the temperature to 1. And again, we can call the predict function on our model. And we will print out the result with this different temperature value. And this time, we reached into the toolkit to fetch my saw. I ran this earlier. I saw sandpaper, which I thought was a pretty interesting response. The model also actually produced some additional information here as well. So, you can try this out, and you'll get a different response if you run this again. So, I encourage you to try out some different temperature values and see how that changes the responses from the model. Now, in addition to temperature, there are two other hyperparameters that you can set to impact the randomness and the output of the model. So, let's return to our example from earlier where we had an input sentence, the garden was full of beautiful, and this probability array over tokens. One strategy for selecting the next token is called TopK, where you sample from a shortlist of the TopK tokens. So, in this case, if we set K to two, that's the two most probable tokens, flowers and trees. Now, TopK can work fairly well for examples where you have several words that are all fairly likely, but it can produce some interesting or sometimes not particularly great results when you have a probability distribution that's very skewed. So, in other words, you have a one word that's very likely and a bunch of other words that are not very likely. And that's because the top K value is hard coded for a number of tokens. So, it's not dynamically adapting to the number of tokens. So, to address this limitation, another strategy is top P, where we can dynamically set the number of tokens to sample from. And in this case, we would sample from the minimum set of tokens whose cumulative of probability is greater than or equal to P. So, in this case, if we set P to be 0.75, we just add the probabilities starting from the most probable token. So, that's flowers at 0.5, and then we add 0.23, 0.05, and now we've hit the threshold of 0.75. So, we would sample from these three tokens alone. So, you don't need to set all of these different values, but if you were to set all of them, this is how they all work together. First, the tokens are filtered by top K, and from those top K, they're further filtered by top P. And then finally, the output token is selected using temperature sampling. And that's how we arrive at the final output token. So, let's jump into the notebook and try and set some of these values. So, first we'll start off by setting a top P value of 0.2. And note that by default, the top P value is going to be set at 0.95. And this parameter can take values between 0 and 1. So, here is a fun prompt. Let's ask for an advertisement about jackets that involves blue elephants and avocados, two of my favorite things. So, we can call the generation model predict function again. And this time, we'll pass in the prompt. We'll also pass in a temperature value, let's try something like 0.9, and then we'll also pass in top P. And note that temperature by default at zero does result in a deterministic response. It's greedy decoding, so the most likely token will be selected at each timestamp. So, if you want to play around with top P and top K, just set the temperature value to something other than zero. So, we can print out the response here and see what we get. And here, is an advertisement introducing this new blue elephant avocado jacket. So lastly, let's just see what it looks like to set top P and top K. So let's set a top K to 20. And by default, top K is going to be set to 40. And this parameter takes values between one and 40. So, we'll half that default value. And then, we'll also set top P so we can set all three of these parameters we just learned about. And we'll use the exact same prompt as before, we'll just keep it as write an advertisement for jackets that involves blue elephants and avocados. And this time, when we call the predict function on our generation model, we'll pass in the prompt, the temperature value, the top k value, and the top p value. And just as a reminder, this means that the output tokens will be first filtered by the top k tokens, then further filtered by top p, and lastly, the response tokens will be selected with temperature sampling. So here, we've got a response here, and we can see that it is a little different from the one we saw earlier. So, I encourage you to try out some different values for top p, top k, and temperature, and also try out some different prompts and see what kinds of interesting responses or use cases or behaviors you can get these large language models to take on. So, just as a quick recap of the syntax we just learned, again, we've been importing this text generation model, and then we loaded this text bison model, and we define a prompt, which is the input text to our model. And when we call predict, we can, in addition to passing in a prompt, also pass in a value for temperature, top K and top P. So now that you know a little bit about how to use these models for text generation, I encourage you to jump into the notebook, try out some different temperature, top P and top K values, and also experiment with some different prompts. And when you're ready, we'll take what you've learned about text generation and combine it with embeddings in the next lesson to build a question answering system.