is a very popular application of LLMs. You'll learn a bit about that in this lesson, as well as some LangChain models which make them easier to build, document loaders, and splitters. Let's get started. So now that we have some fundamentals down for chaining modules together, we'll continue on our journey to construct a chat with your data application by going over some techniques to store our own documents for later retrieval to ground the LLMs generation with that context. This is generally called Retrieval Augmented Generation, or RAG for short. The basic flow is as follows. First we'll load documents from a source. These could be PDFs, databases, or the web. Then we'll split those docs into chunks small enough to fit into an LLMs context window so as to avoid distraction. Then we'll embed those chunks in a vector store to allow for later retrieval based on those input queries. And then when a user wants to access some data, we'll retrieve those relevant previously split chunks and generate a final output with those chunks as context. In this lesson, we'll cover the first two steps in this flow. To perform the first step, we'll use some of LangChain's document loaders. LangChain has a variety of document loaders that bring in data from various sources across the web or from proprietary companies. And then we'll get into splitting, where we'll use some of LangChain's text splitting abstractions to format our text in a way legible to the LLM web. So we'll start with loading. An example of one of the many LangChain loaders is GitHub. So we can paste in an import here. And this particular loader requires a peer dependency used to support Git Ignore Syntax called "Ignore." Once we've loaded it, we'll instantiate it with a GitHub repo, in this case, of course, LangChain.js. We won't go too deep into the packages for demo purposes. So we'll turn off the recursive option. And then we'll ignore certain paths like documentation in the form of markdown files and the very long yarn lock file. Next, after instantiating it, we'll load it. So define the output and then run loader.load. All the arguments we need are in the constructor above. And then let's log some of the outputs. So docs.slice, zero to three, and run it. And then we can see that we are indeed returning a few of the files from the LangChain.js repo at the top level. So there's a editor config file, Git attributes, these are mostly meta, and then a Git ignore. So these are getting pulled directly from LangChain's GitHub repo. And we can see the contents here as well. So the definition of data can be pretty loose. It doesn't have to be only structured PDFs. It can be GitHub files, code. It could be SQL rows, very broad. But PDFs are a pretty big use case. So let's see what it would look like to load a PDF. And of course, given that this is a Deeplearning AI course, why don't we use a transcript of Andrew Ang's famous CS229 course on machine learning? So much like in the GitHub example above, we will import it with a peer dependency, in this case, PDF parser. And then I've already prepared a copy of this transcript on file in the Notebooks file system under this file path. So I'll run that. I have to initialize it. And then run that same loader.load command. cs229 docs equals await loader.load. And then let's log just the first three pages that we'll get. And this time we see that we get the title of the transcript, Machine Learning Lecture 01 with our famous instructor Andrew Ang. And then pages after that. So this loader splits by pages, page one, page two, page three, page four, and finally page five. There are many pages after this, we're just showing the first five loaded documents. There's also some metadata which is useful for more advanced querying and more advanced filtering, which will be a bit beyond the scope of this lesson. Now that we've loaded some data, let's get into splitting. And the idea here, and the goal as a reminder, is to try to keep semantically related ideas together in the same chunk so that the LLM gets an entire self-contained idea without distraction. You notice the previous example with the CST29 transcript, there are many, many pages with a lot of text, and our LLMs only have a fixed amount of attention to give to each chunk. So there are many different strategies for splitting on data and it's going to really depend on what you're loading. So for the GitHub JavaScript example, we may want to split on code specific delimiters because those tend to group input documents into function definitions or classes that the LLM can work with rather than splitting somewhere in the middle. So to show what this looks like, we'll import a text splitter and then we'll initialize it like this. And you'll notice that we've set a few parameters here. We're using an initializer from language and expecting it for the JavaScript language so it knows to use some of the common JS language features as separators between chunks. We set a small max chunk size for demo. This is 32 characters, which is much smaller than you'd probably want to use in practice, but we'll get the point across here. And we set a chunk overlap of zero. This can be useful in some cases to sort of let your chunks flow into each other and get cut off at more natural points. But again, for demo purposes, we'll set this to zero. So after initializing it, let's give it some code. And we'll use a simple hello world function here, the declaration, and then a call with a comment. And let's try splitting on that code. And you can see the result here is four chunks with pretty natural splits. So we get a function definition, hello world, the log statement all in one line, the comment on one line, and then the call on one line as well. And to show kind of like the alternative here, if we split naively, for example, using something like spaces as a separator, we may get chunks generating, for example, half a log statement which makes the LLMs job more difficult on final generation and to show what this looks like let's import a more naive character splitter and we'll set similar parameters here where small chunk size and no overlap and then a pretty naive separator just spaces and we'll leave this on screen for comparison's sake. We'll give it the same code, and then let's just see what happens when we call split on it. And as you can see, we did get the function definition on one line, which is something, but we split the arguments to the function. So you can imagine if the LLM got a chunk with half of a function call, it wouldn't really know necessarily what was on the other half and lose some context. And then we've also split more common block and yeah, just made this a little bit harder for the LLM to deal with. And just quickly to show off some of the ways you can tune this to improve performance here, you know, let's say we want to get some of the definition into the declaration because, you know, it's nice that we have a function declaration on its own and that's useful, but maybe we also want to get some of the function body as context as well. So we can try something pretty similar to the above where we use our cursive text splitter. But let's turn up the chunk size a little bit and let's also give it some overlap. So let's try it with some tune parameters where we increase the chunk size and make the chunks overlap a little bit. So the idea is a little bit more split naturally. And this time you can see that even though it's a little bit less efficient in terms of how much in that we're sort of putting redundant information into these chunks, we do get the entire function definition and function body in the same chunk, which means in LLM receiving this data would have full context into the entirety of the function. So, LangChain includes several different options for different types of content, including Markdown, HTML, JavaScript, which you just saw, Python, and more. But for generic written text, the Recursive Character Text Splitter is a great place to start since it splits chunks on paragraphs as natural boundaries for people to split up their thoughts and points. So let's initialize one that we'll use on our previously loaded CS229 class. And it'll be a text splitter this time with a little bit bigger chunks. So we're gonna use 512 characters characters maximum and then a 64 character overlap. And there we go. Let's split up our previously ingested transcript of Andrew Ang's course and see what happens. And let's log the result. We'll again take the first five so that our output doesn't get too long here. And this time we can see that we get a chunk on page one and then another chunk on page one as well. So, you know, smaller than one page and should still ideally be self-contained ideas that we're going to pass the LLM. So we get some info on what Paul Baumstark is doing with machine learning. Some other folks, Daniel Ramage applies learning algorithms to problems in natural language and some thoughts on Andrew Ang's daily applications of deep learning. So yeah, I think that looks like a pretty good start. So in the next section, we'll show how to embed and add these chunks to a vector store for easier search and query.