We'll walk you through some core concepts on how to evaluate RAG systems. Specifically, we will introduce the RAG triad, a triad of metrics for the three main steps of a RAG's execution. Context relevance, groundedness, and answer relevance. These are examples of an extensible framework of feedback functions. Programmatic evaluations of LLM apps. We then show you how to synthetically generate an evaluation data set, given any unstructured corpus. Let's get started. Now I'll use a notebook to walk you through the RAG triad, answer relevance, context relevance, and groundedness to understand how each can be used with truelens to detect hallucinations. At this point, you have already PIP installed TruLens, eval and Llama Index. So I'll not show you that step. The first step for you will be to set up an OpenAI API key. The OpenAI key is used for the completion step of the RAG and to implement the evaluations with TruLens. So here's a code snippet that does exactly that. And you're now all set up with the OpenAI key. The next section, I will quickly recap the query engine construction with Llama Index. Jerry has already walked you through that in lesson one in some detail. We will largely build on that lesson. The first step now is to set up a true object. From TruLens eval, we are going to import the true class. Then we'll set up a true object and instance of this class. And then this object will be used to reset the database. This database will be used later on to record the prompts, responses, intermediate results of the Llama Index app, as well as the results of the various evaluations we will be setting up with TrueLens. Now let's set up the Llama Index reader. So this snippet of code reads this PDF document from a directory on how to build a career in AI written by Andrew Ang and then loads this data into this document object. The next step is to merge all of this content into a single large document rather than having one document per page, which is the default setup. Next, we set up the sentence index, leveraging some of the Llama Index utilities. So you can see here that we are using OpenAI GPT 3.5 Turbo set at a temperature of 0.1. As the LLM that will be used for completion of the RAG. The embedding model is set to bge small, and version 1.5. And all of this content is being indexed with the sentence index object. Next, we set up the sentence window engine. And this is the query engine that will be used later on to do retrieval effectively from this advanced RAG application. Now that we have set up the query engine for sentence-window-based RAG, let's see it in action by actually asking a specific question. How do you create your AI portfolio? This will return a full object with the final response from the LLM, the intermediate pieces of retrieved context, as well as some additional metadata. Let's take a look at what the final response looks like. So here you can see the final response that came out of this sentence window-based rag. It provides a pretty good answer on the surface to this question of how do you create your AI portfolio. Later on, we will see how to evaluate answers of this form against the RAG triad to build confidence and identify failure modes for RAGs of this form. for rags of this form. Now that we have an example of a response to this question that looks quite good on the surface, we will see how to make use of feedback functions, such as the rag triad, to evaluate this kind of response more deeply, identify failure modes, as well as build confidence or iterate to improve the LLM application. Now that we have set up the sentence-window-based RAG application, let's see how we can evaluate it with the RAG triad. We'll do a little bit of housekeeping in the beginning. First step is this, is of course snippet that lets us launch a StreamLit dashboard from inside the notebook. You'll see later that we'll make use of that dashboard to see the results of the evaluation and to run experiments, to look at different choices of apps, and to see which to look at different choices of apps and to see which one is doing better. Next up, we initialize OpenAI GPT 3.5 Turbo as the default provider for our evaluations. And this provider will be used to implement the different feedback functions or evaluations, such as context relevance, answer relevance, and groundedness. Now, let's go deeper into each of the evaluations of the RAG triad, and we'll go back and forth a bit between slides and the notebook to give you the full context. First up, we'll discuss answer relevance. Recall that answer relevance is checking whether the final response is relevant to the query that was asked by the user. To give you a concrete example of what the output of answer relevance might look like, here's an example. The user asked the question, how can altruism be beneficial in building a career? This was the response that came out of the RAG application. And the answer relevance evaluation produces two pieces of output. One is a score. On a scale of 0 to 1, the answer was assessed to be highly relevant, so it got a score of 0.9. The second piece is the supporting evidence or the rationale or the chain of thought reasoning behind why the evaluation produced this score. So here you can see that supporting evidence found in the answer itself, which indicates to the LLM evaluation that it is a meaningful and relevant answer. I also want to use this opportunity to introduce the abstraction of a feedback function. Answer relevance is a concrete example of a feedback function. More generally, a feedback function provides a score on a scale of 0 to 1 after reviewing an LLM app's inputs, outputs, and intermediate results. Let's now look at the structure of feedback functions using the answer relevance feedback function as a concrete example. The first component is a provider. And in this case, we can see that we are using an LLM from OpenAI to implement these feedback functions. Note that feedback functions don't have to be implemented necessarily using LLMs. We can also use BERT models and other kinds of mechanisms to implement feedback functions that I'll talk about in some more detail later in the lesson. The second component is that leveraging that provider, we will implement a feedback function. In this case, that's the relevance feedback function. We give it a name, a human-readable name that'll be shown later in our evaluation dashboard. And for this particular feedback function, we run it on the user input, the user query, and it also takes as input the final output or response from the app. So given the user question and the final answer from the RAG, this be by function will make use of a LLM provider such as OpenAI GPT 3.5 to come up with a score for how relevant the responses to the question that was asked. And in addition, it'll also provide supporting evidence or chain of thought reasoning for the justification of that score. Let's now switch back to the notebook and look at the code in some more detail. Now let's see how to define the question-answer relevance feedback function in code. From TruLens eval, we will import the feedback class. Then we set up the different pieces of the question answer relevance function that we were just discussing. First up, we have the provider, that is OpenAI, GPT 3.5. And we set up this particular feedback function where the relevance score will also be augmented with the chain of thought reasoning, much like I showed in the slides. We give this feedback function a human-understandable name. We call it answer relevance. This will be show up later in the dashboard, making it easy for users to understand what the feedback function is setting up. Then we also will give the feedback function access to the input, that is the prompt, and the output, which is the prompt, and the output, which is the final response coming out of the RAG application. With this setup, later on in the notebook, we will see how to apply this feedback function on a set of records, get the evaluation scores for answer relevance as well as the chain of thought reasons for why for that particular answer that was the judged score to be appropriate for as part of the evaluation The next feedback function that we will go deep into is context relevance. Recall that context relevance is checking how good the retrieval process is. That is, given a query, we will look at each piece of retrieved context from the vector database and assess how relevant that piece of context is to the question that was asked. Let's look at a simple example. The question here or the prompt from the user is, how can altruism be beneficial in building a career? These are the two pieces of retrieved context. And after the evaluation with context relevance, each of these pieces of retrieved context gets a score between 0 and 1. You can see here the left context got a relevant score of 0.5. The right context got a relevant score of 0.7, so it was assessed to be more relevant to this particular query. And then the mean context relevant score is the average of the relevant scores of each of these retrieved pieces of context. That gets also reported out. Let's now look at the structure of the feedback function for context relevance. Various pieces of this structure are similar to the structure for answer relevance, which we reviewed a few minutes ago. There is a provider, that's OpenAI, and the feedback function makes use of that provider to implement the context-relevance feedback function. The differences are in the inputs to this particular feedback function. In addition to the user input or prompt, we also share with this feedback function a pointer to the retrieve contexts, that is, the intermediate results in the execution of the RAG application. We get back a score for each of the retrieved pieces of context, assessing how relevant or good that context is with respect to the query that was asked, and then we aggregate and average those scores across all the retrieved pieces of context to get the final score. Now you will notice that in the answer relevance feedback function, we had only made use of the original input, the prompt, and the final response from the RAG. In this feedback function, we are making use of the input or prompt from the user, as well as intermediate results, the set of retrieve contexts, to assess the quality of the retrieval. Between these two examples, the full power of feedback functions is leveraged by making use of inputs, outputs, and intermediate results of a RAG application to assess its quality. Now that we have the context selection set up, we are in a position to define the context relevance feedback function in code. You'll see that it's pretty much the code segment that I walked through on the slide. We are still using OpenAI as the provider, GPT 3.5 as the evaluation LLM. We are calling the question statement or context relevance feedback function. It gets the input prompt, the set of retrieved pieces of context, it runs the evaluation function on each of those retrieved pieces of context separately, gets a score for each of them, and then averages them to report a final aggregate score. Now, one additional variant that you can also use, if you like, is in addition to reporting a context-relevant score for each piece of retrieved context, you can also augment it with chain-of-thought reasoning so that the evaluation LLM provides not only a score, So that the evaluation LLM provides not only a score, but also a justification or explanation for its assessment score. And that can be done with QS relevance with chain of thought reasoning method. And if I give you a concrete example of this in action, you can see here's the question or the user prompt, how can altruism be beneficial in building a career? This is an example of a retrieved piece of context that takes out a chunk from Andrew's article on this topic. You can see the context relevance feedback function gives a score of 0.7 on a scale of 0 to 1 to this piece of retrieved context. And because we have also invoked the chain of thought reasoning on the evaluation LLM, it provides this justification for why the score is 0.7. Let me now show you the code snippet to set up the groundedness feedback function. We kick it off in much the same way as the previous feedback functions, leveraging LLM provider for evaluation, which is, if you recall, OpenAI GPT 3.5. Then we define the groundedness feedback function. This definition is structurally very similar to the definition for context relevance. The groundedness measure comes with chain of thought reasons justifying the scores, much like I discussed on the slides. We give it the name groundedness, which is easy to understand. And it gets access to the set of retrieved contexts in the RAG application, much like for context relevance, as well as the final output or response from the RAG. And then each sentence in the final response gets a grounded net score, and those are aggregated, averaged, to produce the final grounded net score for the full response. The context selection here is the same context selection that was used for setting up the context relevance feedback function. So if you recall, that just gets the set of retrieved pieces of context from the retrieval step of the RAG, and then can access each node within that list, recover the text of the context from that node, and proceed to work with that to do the context relevance as well as the groundedness evaluation. With that, we are now in a position to start executing the evaluation of the RAG application. We have set up all three feedback functions, answer relevance, context relevance, and groundedness. And all we need is an evaluation set on which we can run the application and the evaluations and see how they're doing and if there are opportunities to iterate and improve them further. Let's now look at the workflow to evaluate and iterate to improve LLM applications. We will start with the basic Llama Index RAG that we introduced in the previous lesson and which we have already evaluated with the TruLens RAG triad. We'll focus a bit on the failure modes related to the context size. Then we will iterate on that basic rag with an advanced rag technique, the Llama Index sentence window rag. Next, we will re-evaluate this new advanced rag with the TruLens RAG triad, focusing on these kinds of questions. Do we see improvements specifically in context relevance? What about the other metrics? The reason we focus on context relevance is that often failure modes arise because the context is too small. Once you increase the context up to a certain point, you might see improvements in context relevance. In addition, when context relevance goes up, often we find improvements in groundedness as well. Because the LLM in the completion step has enough relevant context to produce the summary. When it does not have enough relevant context, it tends to leverage its own internal knowledge from the pre-training data set to try to fill those gaps, which results in a loss of groundedness. Finally, we will experiment with different window sizes to figure out what window size results in the best evaluation metrics. Recall that if the window size is too small, there may not be enough relevant context to get a good score on context relevance and groundedness. If the window size becomes too big, on the other hand, irrelevant context can creep into the final response, resulting in not such great scores in groundedness or answer relevance. We walked through three examples of evaluations or feedback functions. Context relevance, answer relevance, and groundedness. In our notebook, all three were implemented with LLM evaluations. I do want to point out that feedback functions can be implemented in different ways. Often, we see practitioners starting out with ground truth evals, which can be expensive to collect, but nevertheless a good starting point. We also see people leverage humans to do evaluations. That's also helpful and meaningful, but hard to scale in practice. Ground truth evals, just to give you a concrete example, think of a summarization use case where there's a large passage and then the LLM produces a summary. A human expert would then give that summary a score indicating how good it is. This can be used for other kinds of use cases as well, such as chatbot-like use cases or even classification use cases. Human evals are similar in some ways to ground truth evals in that as the LLM produces an output or a RAG application produces an output, the human users of that application are going to provide a rating for that output, how good it is. The difference with ground truth evals is that these human users may not be as much of an expert in the topic as the ones who produce the curated ground truth evals. It's nevertheless a very meaningful evaluation. It'll scale a bit better than the ground truth evals, but our degree of confidence in it is lower. One very interesting result from the research literature is that if you ask a set of humans to rate a question, there's about 80% agreement. And interestingly enough, when you use LLMs for evaluation, the agreement between the LLM evaluation and the human evaluation is also about the 80 to 85% mark. So that suggests that LLM evaluations are quite comparable to human evaluations for the benchmark data sets to which they have been applied. So feedback functions provide us a way to scale up evaluations in a programmatic manner. In addition to the LLM evals that you have seen, feedback functions also provide can implement traditional NLP metrics such as rouge scores and blue scores. They can be helpful in certain scenarios, but one weakness that they have is that they are quite syntactic. They look for overlap between words in two pieces of text. So for example, if you have one piece of text that's referring to a river bank and the other to a financial bank, syntactically they might be viewed as similar, and these references might end up being viewed as similar references by a traditional NLP evaluation, whereas the surrounding context will get used to provide a more meaningful evaluation when you're using either large language models such as GPT-4 or medium-sized language models such as BERT models, and to perform your evaluation. While in the course we have given you three examples of feedback functions and evaluations, answer relevance, context relevance, and groundedness, TruLens provides a much broader set of evaluations to ensure that the apps that you're building are honest, harmless, and helpful. These are all available in the open source library, and we encourage you to play with them as you are working through the course and building your LLM applications. Now that we have set up all the feedback functions, we can set up an object to start recording, which will be used to record the execution of the application on various records. So you'll see here that we are importing this Tru Llama class, creating an object, Tru Recorder of this Tru Llama class. This is our integration of TruLens with Llama Index. It takes in the sentence-window-engine from Llama Index that we had created earlier, sets the app ID, and makes use of the three feedback functions of the RAG triad that we created earlier. This Tru recorder object will be used in a little bit to run the Llama Index application, as well as the evaluation of these feedback functions, and to record it all in a local database. Let us now load some evaluation questions. In this setup, the evaluation questions are set up already in this text file, and then we just execute this code snippet to load them in. Let's take a quick look at these questions that we will use for evaluation. You can see what are the keys to building a career in AI, and so on. And this file you can edit yourself and add your own questions that you might want to get answers from Andrew Ang. You can also append directly to the eval questions list in this way. Now let's take a look at the eval questions list, and you can see that this question has been added at the end. Go ahead and add your own questions. And now we have everything set up to get to the most exciting step in this notebook with this code snippet. We can execute the sentence window engine on each question in the list of eval questions that we just looked at. And then with Tru recorder, we are going to run each record against the RAG triad. We will record the prompts, responses, intermediate results, and the evaluation results in the true database. And you can see here as the execution of the steps are happening for each record, there is a hash that's an identifier for the record. As the record gets added, we have an indicator here that that step has executed effectively. In addition, the feedback results for answer relevance is done, and so on for context relevance, and so on. Now that we have the recording done, we can see the logs in the notebook by executing, by getting the records and feedback and executing this code snippet. And I don't want you to necessarily read through all of the information here. The main point I want to make is that you can see the depth of instrumentation and the application. A lot of information gets logged to the tru recorder, and this information are on prompts, responses, evaluation results, and so fault. And can be quite valuable to identify failure mods in the apps and to inform iteration and improvement of the apps. All of this information is valuable, flexible JSON format, so they can be exploded and consumed by downstream processes. Next up, let's look at some more human-readable format, for prompts, responses, and the feedback function evaluations. With these quotes stampedes, you can see that for each input prompt or question, we see the output and the respective scores for context relevance, groundedness, and answer relevance, and this is run for each and every entry in this list of questions and evaluations_questions.text. You can see here the last question is "How can I be successful in the AI?" Was the question that I manually embedded to that list have the end. Sometimes in running the evaluations, you might see a non that likely happen because of API key failure. You just want to rerun it to, and show that the execution successfully completes. I just showed you a record level view of the evaluations, the prompts, responses, and evaluations, let's now get an aggregate view in the leaderboard, which aggregates across all of these individual records and produces an average score across the 10 records in that database. So you can see here in the leaderboard, the aggregate view across all the 10 records. We had set the app ID to app 1. The average context relevance is 0.56. Similarly, their average scores for groundedness, answer relevance, and latency across all the 10 records of questions that were asked of the RAG application. It's useful to get this aggregate view to see how well your app is performing and at what level of latency and cost. In addition to the notebook interface, TrueLens also provides a local Streamlit app dashboard with which you can examine the applications that you're building, look at the evaluation results, drill down into record-level views to both get aggregate and detailed evaluation views into the performance of your app. So we can get the dashboard going with the True.run Dashboard method, and this sets up a local database at a certain URL. Now once I click on this, this might show up in some window, which is not within this frame. Let's take a few minutes to walk through this dashboard. You can see here the aggregate view of the app's performance. 11 records were processed by the app and evaluated. The average latency is 3.55 seconds. We have the total cost, the total number of tokens that were processed by the LLMs. And then scores for the RAG triad. For context relevance, it's 0.56. For groundedness, 0.86. And answer relevant is 0.92. We can select the app here to get a more detailed record-evel view of the evaluations. For each of the records, you can see that the user input, the prompt, the response, this metadata, the timestamp, and then scores for answer relevance, context relevance, and groundedness that have been recorded, along with latency, total number of tokens, and total cost. Let me pick a row in which the LLM indicates, evaluation indicates that the LLM, the RAG application has done well. Let's pick this row. Once we click on a row, we can scroll down and get a more detailed view of the different components of that row from the table. So the question here, the prompt was, what is the first step to becoming good at AI? The final response from the RAG was, is to learn foundational technical skills. Down here, you can see that the answer relevance was viewed to be one on a scale of zero to 1. It's a relevant, quite a relevant answer to the question that was asked. Up here, you can see that context relevance, the average context relevance score is 0.8. For the two pieces of context that were retrieved, Both of them individually got scores of 0.8. We can see the chain of thought reason for why the LLM evaluation gave a score of 0.8 to this particular response from the RAG in the retrieval step. And then down here, you can see the groundedness evaluations. So this was one of the clauses in the final answer. It got a score of one. And over here is the reason for that score. You can see this was the statement sentence, and the supporting evidence backs it up. And so it got a full score of one on a scale of 0 to 1, or a full score of 10 on a scale of zero to 10. So previously, the kind of reasoning and information we were talking about through slides and in the notebook, now you can see that quite neatly in this StreamLit local app that runs on your machine. You can also get a detailed view of the timeline, as well as get access to the full JSON object. Now let's look at an example where the rag did not do so well. So as I look through the evaluations, I see this row with a low groundedness score of 0.5. So let's click on that. That brings up this example. The question is how can altruism be beneficial in building a career? There's a response. If I scroll down to the groundedness evaluation, then both of the sentences in the final response have low groundedness score. Let's pick one of these and look at why the groundedness score is low. So you can see this, the overall response got broken down into four statements, and the top two were good, but the bottom two did not have good supporting evidence in the retrieved pieces of context. In particular, if you look at this last one, the final output from the LLM says, additionally, practicing altruism can contribute to personal fulfillment and a sense of purpose, which can enhance motivation and overall well-being, ultimately benefiting one's career success. While that might very well be the case, there was no supporting evidence found in the retrieved pieces of context to ground that statement. And that's why our evaluation gives this a low score. You can play around with the dashboard and explore some of these other examples where the LLM, the final RAG output, does not do so well to get a feeling for the kinds of failure modes that are quite common when you're using RAG applications. And some of these will get addressed as we go into the sessions on more advanced RAG techniques, which can do better in terms of addressing these failure modes. Lesson two is a wrap with that. In the next lesson, we will walk through the mechanism for sentence window-based retrieval and advanced RAG technique, and also show you how to evaluate the advanced technique leveraging the RAG triad and true lengths.