another advanced RAG technique, auto-merging. An issue with the naive approach is that you're retrieving a bunch of fragmented context chunks to put into the LLM context menu, and the fragmentation is worse the smaller your chunk size. Here, we use an auto-merging heuristic to merge smaller chunks into a bigger parent chunk to help ensure more coherent context. Let's check out how to set it up. In this section, we'll talk about auto-merging retrieval. An issue with the standard RAG pipeline is that you're retrieving a bunch of fragmented context chunks put into the LLM context window, and the fragmentation is worse the smaller your chunk size. For instance, you might get back two or more retrieved context chunks in roughly the same section, but there's actually no guarantees on the ordering of these chunks. This can potentially hamper the alum's ability to synthesize over this retrieved context within its context window. So what auto-merging retrieval does is the following. First, define a hierarchy of smaller chunks linking to bigger parent chunks, where each parent chunk can have some number of children. Second, during retrieval, if the set of smaller chunks linking to a parent chunk exceeds some percentage threshold, then we merge smaller chunks into the bigger parent chunk. So we retrieve the bigger parent chunk instead to help ensure more coherent context. Now let's check out how to set this up. This notebook will introduce the various components needed to construct an auto-merging retriever with Llama index. The various components will be covered in detail. And similar to the previous section, at the end, Adupam will show you how to experiment with parameters and evaluation with TruEra. Similar to before, we'll load in the OpenAI API key, and we'll load this using a convenience helper function in our utils file. As with the previous lessons, we'll also use the how to build a career in AI PDF. And as before, we also encourage you to try out your own PDF files as well. We load in 41 document objects, and we'll merge them into a single large document, which makes this more amenable for text blending with our advanced retrieval methods. Now we're ready to set up our auto-merging retriever. This will consist of a few different components, and the first step is to define what we call a hierarchical node parser. In order to use an auto-version retriever, we need to parse our nodes in a hierarchical fashion. This means that nodes are parsed in decreasing sizes and contain relationships to their parent node. Here we demonstrate how the node parser works with a small example. We create a toy parser with small chunk sizes to demonstrate. Note that the chunk sizes we use are 20, 48, 5, 12, and 128. You can change the chunk sizes to any sort of decreasing order that you'd like. Here we do it by a factor of four. Now let's get the set of nodes from the document. What this does is this actually returns all nodes. This returns all leaf nodes, intermediate nodes, as well as parent nodes. So there's going to be a decent amount of overlap of information and content between the leaf, intermediate, and parent nodes. If we only want to retrieve the leaf nodes, we can call a function within Llama index called "gat leaf nodes," and we can take a look at what that looks like. In this example, we call gat leaf nodes on the original set of nodes. And we take a look at the 31st node to look at the text. We see that the text trunk is actually fairly small. And this is an example of a leaf node, because a leaf node is the smallest chunk size of 128 tokens. Here's how you might go about strengthening your math background to figure out what's important to know, etc. Now that we've shown what a leaf node looks like, we can also explore the relationships. We can print the parent of the above node and observe that it's a larger chunk containing the text of the leaf node, but also more. More concretely, the parent node contains 512 tokens, while having four leaf nodes that contain 128 tokens. There's four leaf nodes because the chunk sizes are divided by a factor of four each time. This is an example of what the parent node of the 31st leaf node looks like. Now that we've shown you what the node hierarchy looks like, we can now construct our index. We'll use the OpenAI LLM and specifically GPT 3.5 Turbo. We'll also define a service context object containing the LLM embedding model and the hierarchical node parser. As with the previous notebooks, we'll use the "bge small en embedding model." The next step is to construct our index. The way the index works is that we actually construct a vector index on specifically the leaf nodes. All other intermediate and parent nodes are stored in a doc store and are retrieved dynamically during retrieval. But what we actually fetch during the initial top K embedding lookup is specifically the leaf nodes, and that's what we embed. You see in this code that we define a storage context object, which by default is initialized with an in-memory document store. And we call "storage_context.docstore.addDocuments "to add all nodes to this in-memory doc store. However, when we create our vector store index, called auto-merging index, right here, we only pass in the leaf nodes for vector indexing. This means that, specifically, the leaf nodes are embedded using the embedding model and also indexed. But we also pass in the storage context as well as the service context. And so the vector index does have knowledge of the underlying doc store that contains all the nodes. And finally, we persist this index. If you've already built this index and you want to load it from storage, you can just copy and paste this block of code, which will rebuild the index if it doesn't exist or load it from storage. The last step now that we've defined the auto-merging index is to set up the retriever and run the query engine. The auto-merging retriever is what controls the merging logic. If a majority of children nodes are retrieved for a given parent, they are swapped out for the parent instead. In order for this merging to work well, we set a large top-k for the leaf nodes. And remember, the leaf nodes also have a smaller chunk size of 128. In order to reduce token usage, we apply a re-ranker after the merging has taken place. For example, we might retrieve the top 12, merge and have a top 10, and then re-rank into a top 6. The top end for the re-ranker may seem larger, but remember that the base chunk size is only 128 tokens, and then the next parent above that is 512 tokens. We import a class called auto-merging retriever, and then we define a sentence transformer re-rank module. We combine both the auto-merging retriever and the re-rank module into our retriever query engine, which handles both retrieval and synthesis. Now that we've set this whole thing up end-to-end, let's actually test what is the importance of networking in AI as an example question, we get back a response. We see that it says networking is important in AI because it allows individuals to build a strong professional network and more. The next step is to put it all together. And we'll create two high-level functions, build auto-merging index, as well as get auto-merging query engine. And this basically captures all the steps that we just showed you in the first function, build auto-merging index. We'll use the hierarchical node parser to parse out the hierarchy of child to parent nodes we'll define the service context and we'll create a vector store index from the leaf nodes but also linking to the document store of all the nodes the second function, get auto-merging query engine, leverages our auto merging retriever which is able to dynamically merge leaf nodes into parent nodes and also use our re-rank module and then combine it with the overall retriever query engine. So we build the index using the build auto-merging index function using the original source document, the LLM set to GPT 3.5 turbo, as well as the merging index as a save directory. And then for the query engine, we call get auto merging query engine based on the index, as well as we set a similarity top K of equal to six. As a next step, Anupam will show you how to evaluate the auto-merging retriever and also iterate on parameters using TruEra. We encourage you to try out your own questions as well and also iterate on the parameters of auto-merging retrieval. For instance, what happens when you change the trunk sizes or the top K or the top N for the re-ranker? Play around with it and tell us what the results are. That was awesome, Jerry. Now that you have set up the auto-merging retriever, let's see how we can evaluate it with the RAG triad and compare its performance to the basic RAG with experiment tracking. Let's set up this auto-merging new index. You'll notice that it's two layers. The lowest layer chunk, the leaf nodes, will have a chunk size of 512, and the next layer up in the hierarchy is a chunk size of 2048, in the hierarchy is a chunk size of 2048, meaning that each parent will have four leaf nodes of 512 tokens each. The other pieces of setting this up are exactly the same as what Jerry has shown you earlier. One reason you may want to experiment with the two-layer auto-merging structure is that it's simpler. Less work is needed to create the index, as well as in the retrieval step, there is less work needed because all the third-layer checks go away. If it performs comparably well, then ideally we want to work with a simpler structure. Now that we have created the index with this two-layer auto-merging structure. Let's set up the auto-merging engine for this setup. I'm keeping the top K at the same value as before, which is 12. And the re-ranking step will also have the same and equal six. This will let us do a more direct head to head comparison between this application setup and the three-layer auto-merging hierarchy app that Jerry had set up earlier. Now let's set up the Tru Recorder with this auto-merging engine and we will give this an app ID of app 0. Let's now load some questions for evaluation from the generated questions.txt file that we have set up earlier, now we can define the running of these evaluation questions for each question in eval, we are going to set things up so that the Tru Recorder object, when invoked with the run evals, will record the prompts, responses, and the evaluation results, leveraging the query engine. Now that our evaluations have completed, let's take a look at the leaderboard. We can see that app 0 metrics here. Context relevance seems low. The other two metrics are better. This is with our two-level hierarchy with 512 as the leaf node chunk size and the parent being 2048 tokens, so four leaf nodes per parent node now we can run the true dashboard and take a look at the evaluation results at the record level at the next layer of detail. Let's examine the app leaderboard. You can see here that after processing 24 records, the context relevance at an aggregate level is quite low, although the app is doing better on answer relevance and groundedness. I can select the app. Let's now look at the individual records of app 0 and see how the evaluation scores are for the various records. You can scroll to the right here and look at the scores for answer relevance, context relevance, and groundedness. Let's pick one that has low context relevance. So here's one. If you click on it, you'll see the more detailed view down below. The question is discuss the importance of budgeting for resources and the successful execution of AI projects. On the right here is the response. And if you scroll down further, you can see a more detailed view for context relevance. There were six pieces of retrieve context. Each of them is scored to be particularly low in their evaluation scores. It's between 0 and 0.2. And if you pick any of them and click on it you can see that the response is not particularly relevant to the question that was asked. You can also scroll back up and explore some of the other records you can pick ones, for example, where the scores are good, like this one here, and explore how the application is doing on various questions, and where its strengths are, where its failure modes are, to build some intuition around what's working and what's not. Let's now compare the previous app to the auto-merging setup that Jerry introduced earlier. We will have three layers now in the hierarchy, starting with 128 tokens at the leaf node level 512, one layer up, and 2048 at the highest layer, so at each layer each parent has four children. Now let's set up the query engine for this app setup, the True Recorder, all identical steps as the one for the previous app. And finally, we are in a position to run the evaluations. Now that we have app 1 set up, we can take a quick look. And the total cost is also about half. And that's because, recall, this has three layers in the hierarchy and the chunk size is 128 tokens instead of the 512, which is the smallest leaf node token size for app 0. So that results in a cost reduction. Notice also that context relevance has increased by about 20%. And part of the reason that's happening is that And part of the reason that's happening is that the merging is likely happening a lot better with this new app setup. We can also drill down and look at app one in greater detail. Like before, we can look at individual records. Let's pick the same one that we looked at earlier, the tab 0. It's the question about the importance of budgeting. And now you can see context relevance is doing better. Groundedness is also considerably higher. And if we pick a sample example response here, you'll see that, in fact, it is talking very specifically about budgeting for resources. So there is improvement in this particular instance and also at an aggregate level. Let me now summarize some of the key takeaways from Lesson 4. We walked you through an approach to evaluate and iterate with the auto-retrieval advanced RAG technique. And in particular, we showed you how to iterate with different hierarchical structures, the number of levels, the number of child nodes, and chunk sizes. And for these different app versions, you could evaluate them with the RAG triad and track experiments to pick the best structure for your use case. One thing to notice is that not only are you getting the metrics associated with the RAG triad as part of the evaluation, but the drill down into the record level can help you gain intuition about hyperparameters that work best with certain doc types. For example, depending on the nature of the document, such as employment contracts versus invoices, you might find that different chunk sizes and hierarchical structures work best. Finally, one other thing to note is that auto-merging is complementary to sentence window retrieval. And one way to think about that is, let's say you have four child nodes of a parent. With auto-merging, you might find that child number one and child number four are very relevant to the query that was asked. And these then get merged under the auto-merging paradigm. In contrast, sentence windowing may not result in this kind of merging because they are not in a contiguous section of the text. That brings us to the end of Lesson 4. We have observed that with advanced rag techniques such as sentence windowing and auto-merging retrieval augmented with the power of evaluation, experiment tracking, and iteration, you can significantly improve your rag applications. Improve your RAG applications. In addition, while the course has focused on these two techniques and the associated RAG triad for evaluation, there are a number of other evaluations that you can play with in order to ensure that your LLM applications are honest, harmless, and helpful. This slide has a list of some of the ones that are available out of the box in TrueLens. We encourage you to go play with TrueLens, explore the notebooks, and take your learning to the next level.