In order to create an application where you can chat with your data, you first have to load your data into a format where it can be worked with. That's where LangChain document loaders come into play. We have over 80 different types of document loaders, and in this lesson we'll cover a few of the most important ones and get you comfortable with the concept in general. Let's jump in! Document loaders deal with the specifics of accessing and converting data from a variety of different formats and sources into a standardized format. There can be different places that we want to load data from, like websites, different databases, YouTube, and these documents can come in different data types, like PDFs, HTML, JSON. And so the whole purpose of document loaders is to take this variety of data sources and load them into a standard document object. Which consists of content and then associated metadata. There are a lot of different type of document loaders in LangChain, and we won't have time to cover them all, but here is a rough categorization of the 80 plus that we have. There are a lot that deal with loading unstructured data, like text files, from public data sources, like YouTube, Twitter, Hacker News, and there are also even more that deal with loading unstructured data from the proprietary data sources that you or your company might have, like Figma, Notion. Document loaders can also be used to load structured data. Data that's in a tabular format and may just have some text data in one of those cells or rows that you still want to do question answering or semantic search over. And so the sources here include things like Airbyte, Stripe, Airtable. Alright, now let's jump into the fun stuff, actually using document loaders. First, we're going to load some environment variables that we need, like the OpenAI API key. The first type of documents that we're going to be working with are PDFs. So let's import the relevant document loader from Langchain. We're going to use the PyPDF loader. We've loaded a bunch of PDFs into the documents folder in the workspace, and so let's choose one and put that in the loader. Now let's load the documents by just calling the load method. Let's take a look at what exactly we've loaded. So this, by default, will load a list of documents. In this case, there are 22 different pages in this PDF. Each one is its own unique document. Let's take a look at the first one and see what it consists of. The first thing the document consists of is some page content, which is the content of the page. This can be a bit long, so let's just print out the first few hundred characters. The other piece of information that's really important is the metadata associated with each document. This can be accessed with the metadata element. You can see here that there's two different pieces. One is the source information. This is the PDF, the name of the file that we loaded it from. The other is the page field. This corresponds to the page of the PDF that it was loaded from. The next type of document loader that we're going to look at is one that loads from YouTube. There's a lot of fun content on YouTube, and so a lot of people use this document loader to be able to ask questions of their favorite videos or lectures or anything like that. We're going to import a few different things here. The key parts are the YouTube audio loader, which loads an audio file from a YouTube video. The other key part is the OpenAI Whisper parser. This will use OpenAI's Whisper model, a speech-to-text model, to convert the YouTube audio into a text format that we can work with. We can now specify a URL, specify a directory in which to save the audio files, and then create the generic loader as a combination of this YouTube audio loader combined with the OpenAI Whisper parser. And then we can call "loader.load" to load the documents corresponding to this YouTube. This may take a few minutes, so we're going to speed this up and post. Now that it's finished loading, we can take a look at the page content of what we've loaded. And this is the first part of the transcript from the YouTube video. This is a good time to pause, go choose your favorite YouTube video, and see if this transcription works for you. The next set of documents that we're going to go over how to load are URLs from the Internet. There's a lot of really awesome educational content on the Internet, and wouldn't it be cool if you could just chat with it? We're going to enable that by importing the web-based loader from LangChain. Then we can choose any URL, our favorite URL. Here, we're going to choose a markdown file from this GitHub page and create a loader for it. And then next, we can call loader.load, and then we can take a look at the content of the page. Here, you'll notice there's a lot of white space, followed by some initial text, and then some more text. This is a good example of why you actually need to do some post-processing on the information to get it into a workable format. Finally, we'll cover how to load data from Notion. Notion is a really popular store of both personal and company data, and a lot of people have created chatbots talking to their Notion databases. In your notebook, you'll see instructions on how to export data from your Notion database into a format through which we can load it into LangChain. Once we have it in that format, we can use the Notion directory loader to load that data and get documents that we can work with. If we take a look at the content here, we can see that it's in markdown format, and this Notion document is from Blendle's Employee Handbook. I'm sure a lot of people listening have used Notion and have some Notion databases that they would like to chat with, and so this is a great opportunity to go export that data, bring it in here, and start working with it in this format. That's it for document loading. Here, we've covered how to load data from a variety of sources and get it into a standardized document interface. However, these documents are still rather large, and so in the next section, we're going to go over how to split them up into smaller chunks. This is relevant and important because when you're doing this retrieval augmented generation, you need to retrieve only the pieces of content that are most relevant, and so you don't want to select the whole documents that we've loaded here, but rather only the paragraph or few sentences that are most topical to what you're talking about. This is also an even better opportunity to think about what sources of data we don't currently have loaders for, but you might still want to explore. Who knows? Maybe you can even make a PR to LangChain.