In this lesson, you'll learn how to enrich extracted content with metadata, which helps improve downstream RAG results by supporting hybrid search, allowing you to chunk content more meaningfully for semantic search. All right, let's go. First, you'll learn what even is metadata. So metadata is additional information that we extract while we're pre-processing the document. Metadata can be at the document level or the element level, and it could be something that we either extract from the document information itself, like the last modified date or the file name, or something that we infer while we're pre-processing the document, for instance, the category of element type or hierarchical relationships between the different element types. These metadata fields will come into play when you're creating your RAG application, particularly for applications like hybrid search. And so what metadata looks like in practice is as follows. You'll have your text, that's the actual raw content that you extracted from the document, and then you'll have all of this other information as well, such as the page number, the language, the file name, the type of the element. All this information will come into handy when you go to build hybrid search systems for your RAG application. Before you learn about hybrid search, however, it's important to first understand some basics about semantic search for LLMs. And so in a retrieval augmented generation system, the first step is typically retrieving documents from a vector database. The most basic way to do this is through semantic search, so looking for content that is similar to a query. And so after you've loaded your documents into a vector database, you run a query, and then we search for documents based on their similarity score, which is based on a measure of distance. And so in the example in the bottom right, you can see that that is a tomato dog. And so it is close in the vector space to both tomato and dog. And then you can also see that the animals are clustered all together because they're similar conceptually and the animals are further apart from tomato because tomato is a fruit. And so the idea here is if you're searching over a corpus of documents, you can ask the vector database to return information that is of interest to your query to the LLM and you'll get back similar documents that you're able to then insert into a prompt that gets fed into your LLM. And so that's typically what you're trying to do in the semantic search part of a RAG application. Okay, so now you've learned the basics of similarity search. So what's the problem? There are some cases where similarity search isn't ideal for returning information for your RAG system. In some cases, there could be too many matches. This occurs in a lot of cases when the document is about the same topic. And so if you search on that topic, you're gonna get a lot of information back. You may also want to bias your results based on some other information, for instance, more recent information. And so you might not want only the most similar content. You might want the most similar content that occurred within a certain time frame. Finally, you also lose some important information when you only are searching on semantic similarity. There's important information contained within the document, such as section information that might inform your search results. And so that's where hybrid search comes into play. Hybrid search allows you to combine similarity search with structured information that you extract from the document as metadata. And so you can use metadata as filtering options. And so if you want to limit your search to a particular section, you can filter on that metadata field. Or if you want to limit your results to more recent information, you can structure your query so that you only return documents after a particular date. The first document here shows the results of the Super Bowl on February 12th, whereas the second document is an article about who's favored to win the Super Bowl on February 11th. If you were to conduct a similarity search on these two documents, if you were to say, tell me the most recent information about the Super Bowl, you would get both of these documents potentially back. However, if the Super Bowl already occurred, you might not be as interested in the information from before the Super Bowl. And so in this case, you can use hybrid search to limit your search results to information that occurred after the completion of the Super Bowl. So in this case, you would say, only give me information that happened after February 12th. And so by doing that, you'll get your relevant results, the results from the day, and you'll filter out the results from earlier. Okay, so now you learned about metadata and hybrid search. Let's see what this looks like in practice. Just like in the previous lesson, you can start by importing some helper functions from the unstructured library. You learned about these in the last lesson, so we won't cover them again now. In this lesson, you do have one additional import that's important to highlight, ChromaDB. So ChromaDB is an in-memory vector database that in this lesson you'll use to conduct your similarity search. Okay, so now that you've learned about metadata, you can put that into practice. In this example, you'll work with an e-publication about winter sports in Switzerland. The goal in this example is for you to be able to identify the chapter for each section in the document and then conduct a hybrid search where you'll ask a question about a particular chapter within the document. First, you can take a look at the document. The cover looks like that. And more importantly for this application, here's the table of contents. In this application here, you're going to look for the titles in the table of contents. When we pre-process this document, we'll get a metadata field called parent ID that attaches each element to a title, which is the title of a section. And then you'll use that metadata in order to construct a hybrid search where you search for content within a particular section or chapter in this case, within the document. If you'd like to see the contents of this document, we've provided a PDF version of the e-publication if you'd like to take a look. The first step in the process is to run the document through the unstructured API. In this case, you're going to use the unstructured API instead of the open source library, because in case, e-publications get converted to HTML before pre-processing. And so that requires some extra dependencies. So we'll rely on the API to help us with that. Once you've made your request to the unstructured API, you get your response back. This is a large document, and so it may take the API a few seconds to pre-process it. Once you've processed the document, you can use the JSON display function that you learned about in the previous lesson to explore the results. And so you can see that the first element that it captured is the title of the book, and it classified that element as a title. And you can see some metadata down here. In this case, we performed a filtering operation. We filtered for elements that are title elements that contain the word hockey. And so in this case, you were able to discover the ice hockey title within the table of contents, and then also the ice hockey title that begins the ice hockey chapter. In this case, you'll look for elements that relate to the ice hockey chapter title through the parent ID, which will correspond to this element ID for the ice hockey title element. Of note, this is a filtering operation, and in this case, we filtered down to titles. You can also try yourself filtering down to various other element types. So you can try narrative text, and in this case, you'll get back narrative texts that refer to hockey. But for this use case here, we're interested in titles, so we'll switch that back. In your free time though, you may want to experiment with different filtering operations. Now that we know how to filter, you can set up the names of the chapters. And so you can take this from the list of chapters in the table of contents. Now you have the chapter names from the table of contents. Once you have those chapter names, you can loop through all of the elements that you've extracted and find the element IDs that are associated with those chapter titles. Why is that important? Well, the element ID for those titles will show up as a parent ID for elements that fall within that chapter. And so by doing that, you're able to identify elements within a chapter, which enables you to conduct hybrid search on a chapter. And so that will allow you to, for example, search only for content that appears within the a hockey chapter. So when you do that, what you wind up with is this mapping that maps parent element IDs to chapter names. Once you have that mapping, you can easily convert parent IDs to chapter names. And so in this case, you're able to take a look at this parent ID here. And so this is an example of an element that's within the ice hockey chapter. And so it has a parent ID that corresponds to the element ID for the ice hockey title. And so we know that any element that has this parent ID is an element within the hockey chapter. Okay, now we've identified the elements in the document and we've determined which chapter each element belongs to. The next step is to load all of this content into your vector database, which is what will allow us to perform hybrid search. In this case, you're going to use Chroma as your vector database because it runs in memory and is just generally a nice vector database. You can also learn more about the Chroma vector database in another deeplearning.ai short course. So first, you set up your Chroma database. And then in this case, you'll set up a collection. You do pass a couple of pieces of information to the create_collection function. One is the name, and then second is some metadata. Don't get confused, though. This isn't the metadata that we're extracting from the document. This is metadata for the vector database itself. In this case, we're just saying to use cosine similarity within this particular vector space for the similarity search. Once you've set that up, you can use your chapter mapping to insert documents into the vector database with chapter metadata. And so if you'll recall from earlier, elements within a particular chapter have a parent ID that correspond to the element ID for the chapter title, Okay, now the elements are all loaded into the vector database. You can use the collection.peek method to take a look at what got loaded into the vector database. And so here's what some of the documents look like here. Now that the elements are loaded into the vector database with the metadata, you're able to perform a hybrid search with metadata on that vector database. And so let's take a look at what that looks like here in Chroma. First, you'll set up a query that'll include query texts. In this case, you'll ask how many players are on the team and you'll perform a hybrid search by conditioning that search on content that occurs in the ice hockey chapter. And so this is important because that will limit your results only to information that appears within that chapter. And so if there's another team sport that's available within the document, you won't get information about that. You will only get information about ice hockey. And so you can run the result here and then see what the results look like. And then if you take a look at the results, you can see some information about how many players are on a hockey team. You can try this for yourself as well. Try querying the corpus with a different query or potentially try filtering on another chapter or some other information. Okay, now you've learned about extracting metadata while you're pre-processing and also how you can use that as part of hybrid search. It turns out metadata is useful not only for hybrid search but for other operations as well, such as chunking. What does chunking mean and why do you need it? Chunking is taking a long piece of text, like a large document, and breaking it down into smaller pieces so that you can pass those smaller pieces into the vector database. And then you can include those snippets into prompt templates to pass to an LLM. So why do you need to do this? There's two primary reasons. Some LLMs have a limited context window, so there's only so much content you can pass to the LLM. That means you can't pass the full document to the LLM, and so you have to break it up. Second, in a lot of cases, LLMs cost more if you're using a larger context window, and so chunking documents into smaller pieces allows you to save money on inference cost. Your similarity search queries will change based on how you chunk your content. And so if different content is split differently across the chunks, you may get different chunks back. And better chunks in general are going to result in better similarity search outputs, which in turn are going to improve your end result when you query your LLM. The simplest way to chunk is even-sized chunks. There are multiple different ways to do this. One is chunking by characters, another is to chunk by tokens. But the idea here is that you take a big document, you have your threshold, and then whenever you hit your threshold, you split off into a new chunk. However, by using the metadata that we extracted in the previous section, you can chunk in a more intelligent way by using information about the document elements within your document. And so, what does this look like? You first load your documents from your database, you pre-process those documents using a tool like unstructured, you chunk, So what does it mean to chunk by elements? Conceptually, you're doing something different here than you would with traditional chunking techniques. With traditional chunking techniques, you're taking a big document and you're splitting it up. When you're chunking from elements, you're first splitting the document into atomic document elements and then you're combining those elements into chunks. So what does this look like? First, you partition the document like you've already learned. Once you've partitioned the document, you have individual document elements like titles, narrative text, and lists that you can use as the basis for your chunking operations. Once you have those atomic document elements, you can combine them into chunks. When you combine those into chunks, you can apply break conditions in order to group content together that logically belongs together. And so, for example, a break condition could be, whenever you hit a title, create a new chunk. Why is this important? Because titles often indicate the start of a new section. And so when you apply a break condition such as title, what you wind up doing is grouping content from the same section into the same chunk. By applying a break condition such as titles, you're more likely to keep together content that belongs to the same section, which allows you to construct more coherent chunks. When you perform your similarity search in your vector database, this means you're going to get more relevant content back. In so doing, we take advantage of information that we already have about the structure of the document, which allows us to construct chunks in more intelligent ways. You may also notice another piece of information about this chunking strategy, which is that we apply the chunking to the list of standardized document elements. Since we're applying the chunking strategy to that standardized list of document elements, you can apply different chunking strategies to that initial output, which allows you to rapidly experiment with different chunking techniques and see what the results look like in your LLM end user application. And so let's take an example of what this looks like in practice. In the top example you'll see what things look like when you chunk using a traditional character splitting technique. In this case we've applied a threshold and when we hit a certain character count we'll split off into a new chunk. And so you see in this case, you get information from that first section bleeding over into the second chunk. And so if you were to search on open domain question answering, you would get information from that first chunk, which is all about that. And you get information from that second chunk, which includes information that you're interested in, but also includes information about abstract of question answering, which in this case you're not interested in. Wouldn't it be better if you could just chunk based on these sections? And so when you apply chunking with document elements, you wind up with results that look more like this. You find your titles, you split off into new chunks when you hit a new title. And so when you're doing this, you're able to keep content together that belongs to the same section. This time, when you query for information about open domain question answering, you only get content about open domain question answering. You don't get any of the content that you're not interested in about abstract question answering. And so when you query your vector database, you get better results back, which is going to result in a better prompt for your LLM, which is going to result in a better answer from your LLM. So now you've learned about chunking using document elements. Now you can see what that looks like in practice using the same document that you already worked with. In this case, you can start by deserializing your serialized JSON content with the document element information. To do that, you can use the dict_to_elements function Once you have that, it's easy to chunk this content using the chunk_by_title function, This operates on the elements that you deserialized. You can also take a look at a few of the settings here. So this first one instructs the chunking strategy to combine chunks if an element has fewer than 100 characters in it. That prevents having very, very small chunks. It also places a character limit on the size of the chunks. And so if a chunk exceeds that limit, we'll split off into a new chunk. And so after you have set these parameters, you can chunk the content. And you can see what this looks like here by using the JSON display function. And so you can see here, if you remember from the beginning of the lesson, this was the title of the book. And then you can see information about the e-book got chunked in with that title. You can experiment with different settings for combine under end characters and max characters to see what the results look like. You can also verify that the content indeed chunked because we have 752 elements in our chunked output, but only 255 chunks. And so by doing that, you're able to verify that the chunks indeed combined different elements, and now we have fewer but larger chunks. So now that you've learned about how to use these functions, I'd encourage you to do two things. One, partition a document of your own and try chunking it and see what the outputs look like. You can also try loading the chunk document into the vector database and performing a query, just like you learned earlier in this lesson. In the next lesson, you'll learn more about complex preprocessing techniques that run on PDFs and images. All right, see you in the next lesson.