In this lesson, we'll evaluate large language models. We'll dive into the details and take a look at individual generated samples. Let's go to the notebook. The notebook will explore three examples, starting with the simplest one. You'll call an LLM API, in this case OpenAI, and examine the results from the API using "wandb" tables. This will show you how to use tables for evaluation, analysis, and gaining insights from your experimentation. Moving on to the second example. example. You'll create a custom LLM chain and track it with a tool called Tracer. This will demonstrate the value of tracking and debugging more complex chains. Finally, you'll explore how integrations with LangChain allow you to build simple agents that you will track through this integration. Alright, let's begin with the first example. In this case, you'll follow a straightforward workflow. First, you'll design your system prompt and the user prompt. After that, you'll call an OpenAI endpoint using the chat completion API. The API will respond, and then you'll parse the results and log them in a "wandb" table. Previously, you've been generating sprites. Now let's stay in this virtual game world and generate names for your game assets. Now let's jump to the notebook. I'll start by running our imports and setting up the API key. Next, we'll define our project name and specify the model we're using. In this case, GPT 3.5 Turbo. Next we'll log in so we can track our results. And now we'll initialize a new run to track generation. We have some helper code here. First we'll define "completion_with_backoff". This function avoids rate limits. Then we'll define a function that takes a system prompt, user prompt, and a "wandb" table. We'll use our "completion_with_backoff" function to collect responses. Additionally, we're tracking the start time and elapsed time after each response. For each generated response, we're printing out the result. If you run more experiments, the printing out results here in the notebook won't be very efficient. That's why we're logging all outputs to a table. Here we define the system prompt. It's asking the language model to be a creative copywriter, generating names for game assets based on a category. Next, we create a table with all the columns that we want to track. Now let's start with the user prompt being Hero, and see what names are generated by the model. Nice! That's awesome. This name, Unity's Valor, really does sound like a hero. Harmonic Champion, Unity's Chorus, Unity's Valor. I like these names. Next, let's do an item from the game. Let's do user prompt is Jewel. Nice! Here are a few options, Harmony Gems, Laughter's Gem, Gleaming Unity. It seems like Unity is a big theme here. Now let's log this table and go look at it in "wandb". Next you'll click your run link that's printed out here to see the dashboard. Here you can see the results you just generated in the notebook, but they're saved for you so you can get back to them later if you want to share them with someone or just review your work. I'll expand out these columns so I can see the text a little better. Here I can see the system prompt, user prompt, and the generated examples. Now I want an estimate of cost. How much are we paying for the OpenAI API to generate these fun names for our assets? So I can create a new column to do that calculation. Here I hover over the row header and click Insert 1 Right. Now I get this new column that just says row and click on that. Now I can update the cell expression and you'll want to type in here square brackets and then total tokens. And we want to multiply the tokens by the estimated cost. So in this case, because we're using OpenAI, it's 0.0000015. Great! Now we have a new column that's total tokens times that number. And let's rename it cost so it's easy for other people to understand what's happening. Great! Now this table is something I could share with a colleague to show them what I've tried for generating names. In this second example, you'll create a simple chain. Although it may seem like a toy example, you'll use it to demonstrate the concept of a tracer, which is very powerful for debugging LLM chains and workflows. Your chain will consist of just two actions. The first action involves selecting a virtual world, and you'll call it World Picker. As you use this tool, you'll track various aspects, such as the inputs, outputs, start and end time, the result, and whether the action was successful. You'll then pass the output, which is the virtual world, to the next step in the chain, which is generating a description. This step comes with another set of outputs and inputs for you to track, along with their start and end times in the final results. Both of these steps will be traced as spans. They'll be part of the MyChain Trace, which will allow you to understand and analyze this workflow. Now let's see that in the code. Now we will create and trace a chain to generate names for our assets. Up front, we'll define three sample worlds and randomly pick from this list. Then we'll define the config. So we're still using GPT 3.5 Turbo, and you can set the temperature. Here, it's 0.7, but you can increase this if you want more model creativity. And the system message is asking the LLM to be a creative copywriter. So we provide not only a category of the game asset, but also a fantasy world to the LLM. The goal is to design a name for the asset in a given fantasy world. This next function will execute our creative chain, illustrating the concept of tracing and how it's constructed. We will start with the top-level span with our creative chain. Then we'll define a world picker tool, where we'll randomly pick a world from our list, track the start and end times, and log the inputs and outputs of this span. We add the tool span to our top-level trace. Next we pass the output of this tool to the LLM chain, which requires a system and a user prompt. We use OpenAI Chat Completion and save it as a trace to Weights and Biases. We then add it to our top level trace, update the metadata, and log all spans to "wandb" by logging that root span. Finally, we print out the response from our chain. Now we can start a Weights and Biases run. We'll try running this chain with prompts for the game assets Hero and Jewel. Now that we've executed those creative chains, we can see the results. However, we don't know exactly what happened in the background. So, how did we get Volcanic Sentinel for our hero, or Gleamstone for the name of a jewel? Let's finish this run, then take a closer look at the results in the UI. Here, I'll click on the link to this run so I can see the table. Now in this table, I can see the result of what we just ran, the trace view. And I'll expand this out. Up top, we have the table, which captured both of the inputs we sent. So Hero and Jewel, as well as the outputs, the generated names. But what happened behind the scenes? When you click Row 1, you can see in your trace timeline below what happened behind the scenes. So for this first row, let's look at the two steps. First, the World Picker tool takes Hero as input and produces a description, a modern castle. and this is randomly chosen from that array of three options. This result is then passed on to the next step, the OpenAI LLM. You can see that that generated output, that modern fantasy castle, is now actually pulled in here to this next step. Now, for inputs to that LLM, we have the system prompt as well as the results from the world picker. This trace timeline helps us understand the execution of this chain, and how these steps fit together. Now, this trace timeline is pretty simple, there's only two steps, but this gets really useful when you have a lot of different steps in your chain, when it's longer and more complex. This will allow you to debug and pinpoint any issues, if the results aren't what you expect. So, for example, if the world picker failed, we could see that here. Although defining chains manually can be exhausting, there are libraries like. LangChain that can speed up that process. We'll discuss that in the following example. This final example uses a LangChain agent. In contrast to a chain where each step is predetermined and fixed, an agent uses an LLM to reason and make decisions about what steps to take or what tools to use. In the demo, you'll see that the agent is more unpredictable. It's less deterministic, so it's harder to debug, and it will be helpful to use the tracer. In addition to the WorldPicker tool, this time we'll have the agent use a new tool. NameValidator, to check if the name looks good. Now let's head back to the notebook to see this in action. First we'll run the necessary imports. Then we'll start a new run to keep track of our results. Next we'll set an environment variable, "LANGCHAIN_WANDB_TRACING", to true. This sets up the tracking so you'll automatically get those traces logged. Now let me introduce our tools. We'll keep it simple, as this example is just to illustrate the concept of tracing. The first tool is the WorldPicker, which randomly returns a choice from the list of worlds. The second tool is the NameValidator, which checks if the query or name is below 20 characters. If it is, it says this is the correct name. Otherwise, it says this name is too long. We'll use the OpenAI API again, with a temperature of 0.7. You can turn this up if you want more creativity. And we'll initialize the agent with a list of tools. Alright, now we're ready to run a query. We'll ask our agent to find a virtual game world, and imagine the name of a hero in that world. The agent starts a new chain, and we don't even need to wait for it. We can run the agent a couple of times and check out the results. I'll click the run link printed out here to open the dashboard. So here you're on your workspace, and you can see the list of traces. And you can click your first row and expand this out so we can see a little better. So here in this first row of the table, the input is to find a virtual game world and imagine the name of a hero. But the output is, I couldn't generate a name for the hero. Now that's a problem. Something went wrong in this trace. So let's dive in and figure out what's going on. Going down to the Trace Timeline section. I can see the steps that the agent took to get to that result. Let's start with that first step. We put in the prompt, and we can see the agent reasoning. I should pick a virtual game world first, and then generate a name for a hero in that world, and then takes the action, pick world. That's great. So what happens next with that world? It's pulled into the next step where the LLM looks at that virtual world and needs to generate the name for a hero. And the model outputs that. I've picked a virtual game world inhabited by friendly machine learning engineers. Now I need to generate a name for a hero in that world. But this is where things go wrong. It then says action validate name and action input none. So it actually didn't generate a name yet. It skipped right to the validation step. And that's where things are going wrong here. When we get to the validate name step, it takes the input none, because no name was generated, and gives the output this is a correct name, none, because technically it's less than 20 characters, which is the rule for that tool. So using this trace timeline view, we were able to identify where the agent went wrong, and now we can debug this to fix this problem for future generations. In summary, this analysis gave us valuable insights into how and where the chain might have gone wrong, and we can use that information to improve the results and make our agent more successful. Now, in the next lesson, we'll talk about fine-tuning LLMs.