the elements in an embeddings-based retrieval system, and how that fits together in a retrieval augmented generation loop, together with an LLM. So, let's go! So, the first thing I'd like to show you is the overall system diagram of how this works in practice. The way retrieval augmented generation works is, you have some user query that comes in, and you have a set of documents which you've previously embedded and stored in your retrieval system, in this case Chroma. you take your query, you run your query through the same embedding model as you use to embed your documents, which generates an embedding. You embed the query and then the retrieval system finds the most relevant documents according to the embedding of that query by finding the nearest neighbor embeddings of those documents. We then return both the query and the relevant documents to the LLM and the LLM synthesizes information from the retrieved documents to generate an answer. Let's show how this works in practice. To start with we're going to pull in some helper functions from our utilities. This function just basically is a basic word wrap function which allows us to look at the documents in a nicely pretty printed way. And the example that we're gonna use, we're gonna read from a PDF. So we're gonna pull in. PDF Reader. This is a really simple Python package that you can easily import. It's open source. And we're gonna read from Microsoft's 2022 annual report. And so to do that we're going to extract the texts from the report using this PDF reader application and all we're doing here is for every page that the reader has we're extracting the text from that page and we're also stripping the whitespace characters from those pages. Now the other important thing that we really need to do is make sure that we're not sending any empty strings. There aren't any empty pages that we send to our retrieval system. So we're going to filter out those as well and this little loop just basically checks to see if there's an empty string and if there is we don't add it to the final list of PDF texts. And so just to show the sort of output that we get here, we'll print an example. And what we'll do is print the output of the first page of extracted text from this PDF. And here we are, and this is what the PDF reader has extracted as text from the first page of the document. So in our next step, we need to chunk up these pages is first by character and by token. To do that, we're going to grab some useful utilities from LangChain. We're gonna use some LangChain text splitters. We're gonna use the recursive character text splitter and the sentence transformers token text splitter. It's important that we use the sentence transformers token text splitter, and I'll explain why in just a moment. But first, let's start with the character splitter. The character splitter allows us to divide text recursively according to certain divider characters. And what that practically means is, First, in each presented piece of text, the recursive character text splitter will find the double newlines and split on double newlines, and then if the chunks that got split are still larger than our target chunk size, in this case 1000 characters, it will use the next character to split them, then the next character, then just a space, and finally it will split just on character boundaries itself. And we've also selected a chunk overlap of 0. This is a hyperparameter that you can play with to decide what optimal chunking looks like for you. So let's go ahead and run these. And we're going to output the output of the character text splitter. We're going to look at the 10th text split chunk that we get. And we're also going to output the total number of chunks that the character splitter gives us. So let's run this cell and take a look at the output. So we see the 10th chunk is all of this text according to the character, recursive character text splitter. And there's 347 chunks in total from on this annual report PDF. So now we've split by character, but character text splitting isn't quite enough. And the reason for that is because the embedding model, which we use called sentence transformers, has a limited context window width. In fact, it uses 256 characters. That's the maximum context window length of our embedding model. This is a minor pitfall. If you're not used to working with embeddings, you may not consider the embedding model context window itself, but it's very, very important because typically an embedding model will simply truncate any characters or tokens that are beyond its context window. So to make sure that we're actually capturing all the meaning in each chunk when we go to embed it, it's very important that we also chunk according to the token count. And what we're doing here is we're using the Sentence Transformer text splitter, again with a chunk overlap of zero, and we're using 256 tokens per chunk, which is the context window length of the Sentence Transformer embedding model, and I'll go into more detail about that embedding model in a little bit. And we are essentially taking all of the chunks that were generated by the character text splitter, and we are re-splitting them using the token text splitter. Let's put out similar output to what we had in the last cell and see what we observe here. So we see a similar chunk. It's a little bit different to what we got before. Obviously, it's fewer characters because we have only 256 tokens. This is, again, the 10th chunk. And we notice that we have a couple more chunks than we had before. In the previous output, we had 347 chunks. In this output, we have 349. So it's divided a couple of the existing chunks into more pieces. So we have our text chunks. That's the first step in any retrieval augmented generation system. The next step is to load the chunks that we have into our retrieval. And in this case, we'll be using Chroma. So to use Chroma, we need to import Chroma itself. And we're going to use the sentence transformer embedding model, as promised. Now, let's talk a little bit about the Sentence Transformer Embedding Model and what this actually means and what an embedding model really actually even is. So the Sentence Transformer Embedding Model is essentially an extension of the BERT transformer architecture. The BERT architecture embeds each token individually. So here we have the class fire instruction token and then I like dogs. Each token receives its own dense vector embedding. What a Sentence Transformer does is allow you to embed entire sentences or even small documents like we have here by pooling the output of all the token embeddings to produce a single dense vector per document or in our case, per chunk. Sentence transformers are great as an embedding model. They're open source, all the weights are available online, and they're really easy to run locally. They come built into Chromo, and you can learn more about them by looking up the sentence transformers website or taking a look at the linked paper. So that's why we're using sentence transformers. And now hopefully it makes sense why we use the sentence transformer tokenizer text splitter. So what we're going to do is we're going to create a sentence transformer embedding function. This is for use with Chroma. And we're going to demonstrate, basically, what happens when this embedding function gets cold. So that's the output of this. So let's take a look. Now, you may get this warning about HuggingFace tokenizers. This is a minor bug in HuggingFace. This is nothing to worry about, perfectly normal. And here's the output that we get. And you can see this is one very, very long vector. It's a dense vector. Every entry in the vector has a number associated with it. And this is the representation of the 10th text chunk that we showed you before as a dense vector. This vector has 358 dimensions, which sounds like a lot unless you consider the full dimensionality of all English text, which is much, much higher. So the next step is to set up Chroma. We're going to use the default Chroma client, which is great if you're just experimenting in a notebook. And we're going to make a new Chroma collection. And the collection is going to be called. Microsoft Annual Report 2022. And we're also going to pass in our embedding function, which we defined before, which as you remember, is the sentence transformer embedding function. We are going to create IDs for each of the text chunks that we've created. And they're just going to be the string of the number of their position in the total token split texts. And then what we're going to do is we're going to add those documents to our Chroma collection. And to make sure that everything is the way we expect, let's just output the count after everything's been added. And let's run this cell. So now that we have everything loaded into Chroma, So let's connect an LLM and build a full-fledged RAG system. We're going to demonstrate how querying and retrieval and LLMs all work together. So let's start with a pretty simple query. I think if you're reading an annual financial report, one of the top questions you have in mind is, what was the total revenue for this year? And what we're going to do is we're going to get some results from Chroma by querying it. And we see here that we call query on our collection. We pass our query texts. And we're asking for five results. Chroma under the hood will use the embedding function that you've defined on your collection to automatically embed the query for you. So you don't really have to do anything else to call that embedding function again. And we're going to pull the retrieve documents out of the results. This zero on the end is basically saying, give me the results for the zeroth query. We only have the one query. And what we're gonna output now is basically the retrieve documents themselves and take a look. So let's run the cell. And we see that the documents that we get here are fairly relevant to our query. What was the total revenue? We have classified revenue by different product and service offerings. We're talking about an unearned revenue, and there's more information in a similar vein. So the next step is to use these results together with an LLM to answer our query. We're going to use GPT for this, and we need to just do a little bit of setup so that we can have an OpenAI client. We're going to load our OpenAI API key from the environment so that we can authenticate. And we're gonna create an OpenAI client. This is using their new version one API where they've wrapped everything in this one nice client object for us. So running the cell, there won't be any output here, but everything's ready to go. Now we're going to define a function that allows us to call out to the model using our retrieved results along with our query. We're gonna use GPT 3.5 Turbo, which does a reasonably good job in rag loops and is fairly quick and fast. So the first thing is we're going to pass in our query and retrieve documents. We're going to just join our retrieved documents into a single string called information, and we're going to use the double new line to do so. We're going to set up some messages. So the first thing is the system prompt. The system prompt essentially defines how the model should behave in response to your input, and here we're saying you are a helpful expert financial research assistant. Your users are asking questions about information contained in an annual report. You'll be shown the user's question and the relevant information from in the annual report. Answer the user's question using only this information. So what this is doing, and this is really the core of the entire RAG loop, we're turning GPT from a model that remembers facts into a model that processes information. That's the system prompt. And now we're going to add another piece of the message for our user content. And we have here that we're in the role of the user, and here's the content. The content is essentially a formatted string that says, Here's our question, and that's just our original query. Here's the information you're supposed to use. Here's the information. And then we need to send the request to the OpenAI client, which is just using the normal API from the client. There's nothing special here at all. We're specifying the model. We're sending the messages. We're basically calling the chat completion endpoint on the OpenAI client, specifying a model and the messages we'd like to send and getting the response back. And then we need to do a little bit more just to unpack the response from what the client returns. So we've defined our RAG function and now let's actually use it. Let's put everything together. So here's what we're going to do. We are going to say output is equal to calling RAG without query and retrieve documents. Then we're just gonna print the word wrap output and fire away. There we go. The total revenue for the year ended June 30, 2022 was $198,270 million for Microsoft. Microsoft are doing pretty well. Now's a good time to take a moment and try some of your own queries. So remember we specified the query a little bit further up. What was the total revenue? Try some of your own and see what the model outputs based on the retrieved results from the annual report. I think it's actually really important to play with your retrieval system to gain intuition about what the model and the retriever can and can't do together before we dive into really analyzing how the system works. In the next lab, we're going to talk about some of the pitfalls and common failure modes of using retrieval in a retrieval augmented generation loop.