We just went over how to load documents into a standard format. Now, we're going to talk about how to split them up into smaller chunks. This may sound really easy, but there's a lot of subtleties here that make a big impact down the line. Let's jump into it! Document splitting happens after you load your data into the document format. But before, it goes into the vector store, and this may seem really simple. You can just split the chunks according to the lengths of each character or something like that. But as an example of why this is both trickier and very important down the line, let's take a look at this example here. We've got a sentence about the Toyota Camry and some specifications. And if we did a simple splitting, we could end up with part of the sentence in one chunk, and the other part of the sentence in another chunk. And then, when we're trying to answer a question down the line about what are the specifications on the Camry, we actually don't have the right information in either chunk, and so it's split apart. And so, we wouldn't be able to answer this question correctly. So, there's a lot of nuance and importance in how you split the chunks so that you get semantically relevant chunks together. The basis of all the text splitters in Lang Chain involves splitting on chunks in some chunk size with some chunk overlap. And so, we have a little diagram here below to show what that looks like. So, the chunk size corresponds to the size of a chunk, and the size of the chunk can be measured in a few different ways. And we'll talk about a few of those in the lesson. And so, we allow passing in a length function to measure the size of the chunk. This is often characters or tokens. A chunk overlap is generally kept as a little overlap between two chunks, like a sliding window as we move from one to the other. And this allows for the same piece of context to be at the end of one chunk and at the start of the other and helps create some notion of consistency. The text splitters in Lang Chain all have a create documents and a split documents method. This involves the same logic under the hood, it just exposes a slightly different interface, one that takes in a list of text and another that takes in a list of documents. There are a lot of different types of splitters in Lang Chain, and we'll cover a few of them in this lesson. But, I would encourage you to check out the rest of them in your spare time. These text splitters vary across a bunch of dimensions. They can vary on how they split the chunks, what characters go into that. They can vary on how they measure the length of the chunks. Is it by characters? Is it by tokens? There are even some that use other smaller models to determine when the end of a sentence might be and use that as a way of splitting chunks. Another important part of splitting into chunks is also the metadata. Maintaining the same metadata across all chunks, but also adding in new pieces of metadata when relevant, and so there are some text splitters that are really focused on that. The splitting of chunks can often be specific on the type of document that we're working with, and this is really apparent when you're splitting on code. So, we have a language text splitter that has a bunch of different separators for a variety of different languages like Python, Ruby, C. And when splitting these documents, it takes those different languages and the relevant separators for those languages into account when it's doing the splitting. First, we're going to set up the environment as before by loading the Open AIAPI key. Next, we're going to import two of the most common types of text splitters in Lang Chain. The recursive character text splitter and the character text splitter. We're going to first play around with a few toy use cases just to get a sense of what exactly these do. And so, we're going to set a relatively small chunk size of 26, and an even smaller chunk overlap of 4, just so we can see what these can do. Let's initialize these two different text splitters as R splitter and C splitter. And then let's take a look at a few different use cases. Let's load in the first string. A, B, C, D, all the way down to Z. And let's look at what happens when we use the various splitters. When we split it with the recursive character text splitter it still ends up as one string. This is because this is 26 characters long and we've specified a chunk size of 26. So, there's actually no need to even do any splitting here. Now, let's do it on a slightly longer string where it's longer than the 26 characters that we've specified as the chunk size. Here we can see that two different chunks are created. The first one ends at Z, so that's 26 characters. The next one we can see starts with W, X, Y, Z. Those are the four chunk overlaps, And then it continues with the rest of the string. Let's take a look at a slightly more complex string where we have a bunch of spaces between characters. We can now see that it's split into three chunks because there are spaces, so it takes up more space. And so, if we look at the overlap we can see that in the first one there's L and M, and L and M are then also present in the second one. That seems like only two characters but because of the space both in between the L and M, and then also, before the L and after the M that actually counts as the four that makes up the chunk overlap. Let's now try with the character text splitter, And we can see that when we run it doesn't actually try to split it at all. And so, what's going on here? The issue is the character text splitter splits on a single character and by default that character is a newline character. But here, there are no newlines. If we set the separator to be an empty space, we can see what happens then. Here it's split in the same way as before. This is a good point to pause the video, and try some new examples both with different strings that you've made up. And then, also swapping out the separators and seeing what happens. It's also interesting to experiment with the chunk size, and chunk overlap, and just generally get a sense for what's happening in a few toy examples so that when we move on to more real-world examples, you'll have good intuition as to what's happening underneath the scenes. Now, let's try it out on some more real-world examples. We've got this long paragraph here, and we can see that right about here, we have this double newline symbol which is a typical separator between paragraphs. Let's check out the length of this piece of text, and we can see that it's just about 500. And now, let's define our two text splitters. We'll work with the character text splitter as before with the space as a separator and then we'll initialize the recursive character text splitter. And here, we pass in a list of separators, and these are the default separators but we're just putting them in this notebook to better show what's going on. And so, we can see that we've got a list of double newline, single newline, space and then nothing, an empty string. What this mean is that when you're splitting a piece of text it will first try to split it by double newlines. And then, if it still needs to split the individual chunks more it will go on to single newlines. And then, if it still needs to do more it goes on to the space. And then, finally it will just go character by character if it really needs to do that. Looking at how these perform on the above text, we can see that the character text splitter splits on spaces. And so, we end up with the weird separation in the middle of the sentence. The recursive text splitter first tries to split on double newlines, and so here it splits it up into two paragraphs. Even though the first one is shorter than the 450 characters, we specified this is probably a better split because now the two paragraphs that are each their own paragraphs are in the chunks as opposed to being split in the middle of a sentence. Let's now split it into even smaller chunks just to get an even better intuition as to what's going on. We'll also add in a period separator. This is aimed at splitting in between sentences. If we run this text splitter, we can see that it's split on sentences, but the periods are actually in the wrong places. This is because of the regex that's going on underneath the scenes. To fix this, we can actually specify a slightly more complicated regex with a look behind. Now, if we run this, we can see that it's split into sentences, and it's split properly with the periods being in the right places. Let's now do this on an even more real-world example with one of the PDFs that we worked with in the first document loading section. Let's load it in, and then let's define our text splitter here. Here we pass the length function. This is using LEN, the Python built-in. This is the default, but we're just specifying it for more clarity what's going on underneath the scenes, and this is counting the length of the characters. Because we now want to use documents, we're using the split documents method, and we're passing in a list of documents. If we compare the length of those documents to the length of the original pages, we can see that there's been a bunch more documents that have been created as a result of this splitting. We can do a similar thing with the Notion DB that we used in the first lecture as well. And once again, comparing the lengths of the original documents to the new split documents, we can see that we've got a lot more documents now that we've done all the splitting. This is a good point to pause the video, and try some new examples. So far, we've done all the splitting based on characters. But, there's another way to split, and this is based on tokens, and for this let's import the token text splitter. The reason that this is useful is because often LLMs have context windows that are designated by token count. And so, it's important to know what the tokens are, and where they appear. And then, we can split on them to have a slightly more representative idea of how the LLM would view them. To really get a sense for what the difference is between tokens and characters. Let's initialize the token text splitter with a chunk size of 1, and a chunk overlap of 0. So, this will split any text into a list of the relevant tokens. Let's create a fun made-up text, and when we split it, we can see that it's split into a bunch of different tokens, and they're all a little bit different in terms of their length and the number of characters in them. So, the first one is just foo then you've got a space, and then bar, and then you've got a space, and just the B then AZ then ZY, and then foo again. And this shows a little bit of the difference between splitting on characters versus splitting on tokens. Let's apply this to the documents that we loaded above, and in a similar way, we can call the split documents on the pages, and if we take a look at the first document, we have our new split document with the page content being roughly the title, and then we've got the metadata of the source and the page where it came from. You can see here that the metadata of the source and the page is the same in the chunk as it was for the original document and so if we take a look at that just to make sure pages 0 metadata, we can see that it lines up. This is good it's carrying through the metadata to each chunk appropriately, but there can also be cases where you actually want to add more metadata to the chunks as you split them. This can contain information like where in the document, the chunk came from where it is relative to other things or concepts in the document and generally this information can be used when answering questions to provide more context about what this chunk is exactly. To see a concrete example of this, let's look at another type of text splitter that actually adds information into the metadata of each chunk. You can now pause and try a few examples that you come up with. This text splitter is the markdown header text splitter and what it will do is it will split a markdown file based on the header or any subheaders and then it will add those headers as content to the metadata fields and that will get passed on along to any chunks that originate from those splits. Let's do a toy example first, and play around with a document where we've got a title and then a subheader of chapter 1. We've then got some sentences there, and then another section of an even smaller subheader, and then we jump back out to chapter 2, and some sentences there. Let's define a list of the headers we want to split on and the names of those headers. So first, we've got a single hashtag and we'll call that header 1. We've then got two hashtags, header 2, three hashtags, header 3. We can then initialize the markdown header text splitter with those headers, and then split the toy example we have above. If we take a look at a few of these examples, we can see the first one has the content, "Hi, this is Jim ." "Hi, this is Joe." And now in the metadata, we have header 1, and then we have it as title and header 2 as chapter 1, and this is coming from right here in the example document above. Let's take a look at the next one, and we can see here that we've jumped down into an even smaller subsection. And so, we've got the content of "Hi, this is Lance" and now we've got not only header 1. But also header 2, and also header 3, and this is again coming from the content and names in the markdown document above. Let's try this out on a real-world example. Before, we loaded the notion directory using the notion directory loader and this loaded the files to markdown which is relevant for the markdown header splitter. So, let's load those documents, and then define the markdown splitter with header 1 as a single hashtag and header 2 as a double hashtag. We split the text and we get our splits. If we take a look at them, we can see that the first one has content of some page, and now if we scroll down to the metadata, we can see that we've loaded header 1 as Blendel's employee handbook. We've now gone over how to get semantically relevant chunks with appropriate metadata. The next step is moving those chunks of data into a vector store, and we'll cover that in the next section.