You'll explore some of the options for prompting Lama2 models. One of the things that makes prompting Lama2 models unique is how you format the input prompt before you send it to the model. You'll apply the recommended formatting methods using something called instruction tags. You'll get to ask Lama2 to help you write a birthday card for your friend. Let's get started. For this course, we have created a helper function that can make an API call to any Llama model. You can treat the Llama model like your personal assistant and ask it to help you write a birthday card for a friend. Great. It's written a nice birthday card and it also addresses my friend, Andrew. It returned a well-written birthday card letter. But how does this actually happen? How did you actually access the model? The way this happened just now was by using a hosted API service. The service fed your prompt into the Lama model, which output this birthday card. Another option, since Lama models are open for commercial use, is to host the model yourself on your own cloud environment, such as Amazon Web Services, Microsoft Azure, or Google Cloud. And third option, at least for a compressed version of a small Lama model, is to actually download the model and run it on your own personal computer. The point here is that since Lama 2 is open for commercial use, you have many options for how to access the Lama models. Although using a hosted API service is recommended in part because it's much easier way to get started and to easily switch between multiple models. There are quite a few companies that are hosting LAMA models. These include Amazon Bedrock, AnyScale, Google Cloud, Azure, and many more. In this course, you are using Together.ai to access the LAMA models. Together.ai currently allows you to access all the variations of the LAMA 2 models, including the small, medium, and large size, and the Code LAMA models. Something I would like to draw your attention to is the recommended way to format your prompt when using LAMA model. The prompt is surrounded by instruction tags at the recommended way to format your prompt when using Llama model. The prompt is surrounded by instruction tags at the start and end of the prompt. These instruction tags use square brackets and the end instruction tag also includes a forward slash. The helper function that you use was written to add in these instruction tags to your prompt before it gets sent to the model. Let's take a look at the code to see that more clearly. Our helper function has a parameter which you can set that lets us see the actual formatted prompt before it gets sent to the model. Let me copy the prompt. And let me copy the call to our function. And let's set the parameter verbose equals to true. And let's see, runs this and see the prompt. It outputs the prompt and you can now see the start and end instruction tags surrounding the original prompt. It also prints the model that you just used. Remember from earlier in the course that there are regular non-chat LAMA models as well as LAMA chat models. For most use cases, we recommend using the LAMA chat models instead of the base foundation models. Let's see what happens by asking each model what is the capital of France. The helper function lets you explicitly choose which Lama model to use. By default, the helper function uses a small 7 billion chat model, but let's set it explicitly here for clarity. And let's run this. Okay. So now, you see our prompt has instruction tags It says the capital Francis Perez, which is good. Now, We'll just modify the model name to this. The model name is good. Now, The model name is similar, just without the dash chat in its name, llama-2-7b. It didn't answer our question about the capital of France. Instead, it asked us similar questions about the capital of other countries. Foundation models learn to predict the next word given the words that came before it. When it sees what is the capital of France, a logical continuation of that is to ask similar questions about the capitals of other countries. Remember that the foundation model wasn't trained to understand instruction tags, so it's not recommended when using foundation models. If you are curious, you can pause the video here and set the add instruction variable to true and see what happens. In summary, we recommend using the chat version of the Lama2 models such as Lama2 7b chat. By default, our helper function for this course sets temperature to zero. So, And I'm going to write this in the prompt. Help me write a birthday card for my friend Andrew. Here are details about my friend. He likes long walks on the beach and reading in the bookstore. His hobbies include researching papers and speaking at conferences. His favorite color is light blue and he likes pandas. Let's call the llama function and get the response. And let's print the response. All right, so you got a response. The response looks pretty good. Let's run it again to see if we are getting exactly the same response. They are highly likely going to be the same. We are not going to make any changes and run this again. Okay, so now, let's see. It seems, yeah, it is exactly the same. It ends with, I'm so lucky to have you in my life. There you see it. And then, speaking of inspiration, that's great. So now, we know how to get consistent output, Okay. Okay. A token can be a word and is usually a part of a full word. And on average, one token is about three-fourth of a word. So, 1024 tokens is about 768 words. Let's decrease the model's output response length by setting max tokens to 20 and see what happens. So again, I'm going to copy the same prompt which we had. And now, And I'm going to remove the temperature. By default, it is set to zero. And let's see by printing the response. Notice how setting a smaller number of tokens doesn't make the model give its complete answer more succinctly. It just stops partway through its answer. LamaTube models, like any other large language model, have a limit to how many tokens they can take in as input as well as output in their response. Let's print the response and let's look at the response. Okay. Instead, it returns this error message, which says that the input tokens plus the max new tokens, which are the number of output response tokens, must be less than or equal to 4097 tokens. It further notes that there are 3974 input tokens and 1024 max new tokens, which are the output tokens. This means that if you have really big input prompt, you would get a smaller output response. Similarly, if you ask for a long output response, then you may need to be mindful of how long your input prompt is. Let's see if we can stay within the 4097 token limit so that we can still summarize that book. We can reduce max underscore tokens. In our helper function, max underscore tokens is set to 1024 by default, but we can choose something else. Recall that the input prompt had 3974 tokens. Let's calculate 4097 minus 3974, 123. And just a quick note, the parameter name is max tokens. The error message earlier referred to this as max new tokens, but the variable name in the helper function is just max tokens. So, But this time we are going to add max underscore tokens. We are going to set it to 123. And we are going to print the response. All right. This works. We get an output response instead of an error message. Notice that since we set the max tokens to 123, the output response is fairly short. This is limited to 123 tokens. Let's check what happens if we set the output response to be longer than 123 tokens. Let's set this to 124, run it, and we got an error message, which is what was expected. Later in the course, you'll see a set of Lama models that can handle over 20 times the token length of these Lama models. Can you rewrite it to include that? We can see the LLM's answer doesn't have any memory of Andrew and his other hobbies and interests. It also doesn't remember that we are asking it to write a birthday card. In the next lesson, you'll see how we handle this to give the model the proper context. For now, try asking the model to help you with some other writing tasks. Maybe you could ask it to help you draft an email that you're sending to customer service about some product that you have a question about. Or maybe you would like some help with writing a speech that you will give at a friend's wedding.