In this lesson, you'll learn how you can extract and normalize content from a diverse range of document types so your LLM can reference information from PDFs, PowerPoints, Word docs, HTML, and more. Alright, let's dive into the code. First, you'll learn about why you might want to normalize diverse document types. First, documents come in a variety of formats. Just think of all the documents you might have seen at your organization or at your university. You might have seen PDFs or Word documents or even HTML documents, and all of those documents come in different formats. And when you're building an LLM application, really, you don't want to have to care about where your documents came from. You want them all in a common format so that in your LLM application, you can treat them in the same way. And so the first step in doing that is to break those documents down into common elements like titles and narrative text. There are a few benefits to this. First, it allows you to process any document in the same way regardless of the source format. This will allow you to perform operations like filtering out unwanted elements like headers and footers. If you wanted to do that for HTML and PDF documents separately, you would have to write separate logic for each document. By normalizing them, you can do that all with the same snippet of code. Second, it allows you to apply downstream operations like chunking in the same way on all the different document types. And so you don't have to create different chunking strategies on a per document type basis. This also has benefits in terms of reducing processing cost. Typically, the most expensive part about pre-processing documents is extracting the initial content. Downstream operations like chunking are typically inexpensive. By normalizing all of the content to the same format, you can experiment with different chunking strategies in a very computationally inexpensive way. Once you've normalized the content, typically the next step is data serialization. So there are a few benefits to data serialization, and this allows you to reuse your outputs later without having to pre-process the documents again. In this course, you'll serialize your data as JSON. There are other options here, but JSON is convenient for a variety of reasons. First, it's a common and well-understood structure. Second, it's a standard HTTP response. This will come into play when you're processing documents like PDFs and images that require model-based workloads that you'll run over an API. JSON can also be used in multiple programming languages. This is important because if you're building an application in JavaScript, you can pre-process your documents in Python, serialize the outputs to JSON, and then read those outputs into your JavaScript application. And finally, JSON can also be used for streaming use cases. For instance, by storing the documents as JSON-L. Here's an example of what the serialized outputs will look like when you pre-process the documents in this course. So now you'll learn about pre-processing a few different specific document types. So this is relevant for LLMs because in a lot of cases, you want to read in content from the internet into your LLM application. When you're understanding HTML documents, typically you're looking at HTML tags. So for instance, an H1 tag will indicate that content is likely a title. A paragraph tag will indicate that content is likely narrative text. In addition to using these tags, it's also helpful to use natural language processing capabilities to understand the content of the text to better understand the categorization of the content. So for example, long content with multiple sentences in a paragraph tag is likely to be narrative text, whereas short content in a paragraph tag that's all capitalized may be more likely to be a title. So we can use both that structured information and unstructured information within the document to categorize the elements. And there's an example of what that might look like in an actual page. And so in this example here, the title is represented by an H1 tag. And then in the second box under narrative text, that content is represented in a paragraph tag. Now that you've learned about pre-processing HTML, So first, you'll import this. This just filters out warnings that we don't want for classroom purposes. Next, you'll import a few helper functions from the unstructured open source library. In this case, you'll use the partition HTML function from the unstructured open source library to pre-process the HTML document. Later, you'll use partition PowerPoint as well. After importing these helper functions from the unstructured open source library, you'll also set up what you need to use the unstructured API. You'll use the unstructured API in cases where we need to process PDFs or images in this course. That's because these are expensive model-based workloads that require a bit more setup. This code here allows you to set up your credentials to use the unstructured API. And so we're setting up the unstructured API client here. Now let's take a look at the HTML file that you saw earlier in the lesson. And so let's open up a screenshot of it here. And so here's what the document looks like. This is a document from the unstructured blog. You'll read in the HTML file and then partition it using the unstructured open source library. So first, just look at the file name. This file is contained within the directory of your notebook. Second, this is really easy to partition using the unstructured open source library. All you need to do is call the partition HTML function that you imported earlier and pass in the file name. Now that you have your outputs, you can convert them to a dictionary. And then as you learned about in the lesson, you can convert that to a JSON object for serialization. And so here we convert all of the elements to a dictionary. And then from the dictionary, it's easy to go to a JSON file. And so for this initial output, you'll take a look at the string representation of the JSON object. And later, we'll just have a JSON display object that'll make it easier to navigate this object. So now that you've taken a look at what the JSON object looks like, you can navigate this a little bit more easily using this JSON display function from IPython. And so if you execute that, you'll see that each element is an element in that list, and you can go through and easily navigate it. And so in this case, you can see that you've detected narrative text here. And so here's the first paragraph of the Medium blog. And so you've correctly extracted that as narrative text within the normalized output. So now you've learned about pre-processing HTML content, which, again, is very important for web scraping use cases for inserting knowledge that's available on the internet into LLM applications. In a corporate setting, it's also important to be able to pre-process other document types, such as PowerPoint. These are widely used in various business areas, such as consulting or just general business. And for corporate use cases, this is crucial for expanding the LLM's knowledge of your organization. And so the extraction process for Microsoft PowerPoint is actually very similar to HTML, under the hood in PPTX files, so the more modern PowerPoint files, is a bunch of XML that we can pre-process using rules-based logic. You may also be familiar with older.PPT PowerPoint files. When you're dealing with those files, it's fairly easy just to convert them to ".PPTX" And so you can take a look at the file below, you'll see how to pre-process this document in your Jupyter Notebook. Okay, so now you've learned about how to pre-process PowerPoint documents. Let's put this into practice using the PowerPoint deck you just looked at about OpenAI. And so you can take a look at what the document looks like here. And again, just like you did with the HTML file, you can just pass this into the partition PPTX function from the unstructured open source library. So you can just grab the name of the file here. And then just like you did with the HTML file, instead of partition HTML this time, you'll do partition PPTX, pass in the file name, and then you'll be able to get your elements out. Just like you did before, you can convert these outputs into a JSON. And so you can navigate these elements here. And so you can see you correctly identified the first item in the bolded list as a list item, and you extracted the text here. What's important to see here too is that you identified the text ChatGPT as a title. Again, this is important because once you've normalized these documents for your application, you can treat them all in the same way. Okay, so now you've learned about HTML and PowerPoint. In both of those cases, you learned about how to pre-process documents using rules-based logic. Now you'll learn about a more complex use case, PDFs. PDFs are a little bit different than HTML or PowerPoint, whereas in those documents, you are looking at semi-structured information for clues about how to divide element types within the documents. In PDFs or images, you're going to look for something different. You're going to look for visual cues. And so in PDFs, you're looking for things like formatting within the document. So a piece of text within a PDF that's bold and underlined may be more likely to be a title. Something that's a bit longer and blockier and contains multiple sentences and maybe doesn't have emphasis like bolding or underlining may be more likely to be narrative text. And so this allows you to pre-process these documents. And again, you're going to look at visual information within the documents as opposed to semi-structured information like you did in the case of HTML or PowerPoint. And so you can see an example here. And looking at this document, the text at the top, all experimental results, is bold and has that be next to it. And so visually, that looks like it's a title, whereas the text below looks like narrative text. And you've also been able to identify a table within that document as well. You'll learn more specifically about processing tables later in the course. So now you've learned a bit about pre-processing PDFs. Let's put that into practice on a PDF. And in this case, the PDF is going to be an academic paper on chain of thought reasoning. And so first, you can take a look at what this document looks like in the images folder within your notebook directory. And so in this case, again, we're going to find titles like this. We're going to find sections of narrative text and also identify tables. In this case, you start by grabbing the file name just like you did in the previous cases. However, this time, instead of passing the file name to the unstructured open source library, you're going to pass it to the unstructured API. The reason you're going to use the API in this case is because PDF processing is a model-based workload, which is computationally expensive and a little harder to run locally. And so instead of using the open source library, you're going to use the API where the model is already set up and ready to go. And so now you have your file name. You can use the unstructured API client that you set up earlier to prepare your request. And then once you have that ready, you're ready to make your request. Again, this is a model-based workload, so this will take a little bit of time to process. And so if it takes a few seconds, don't worry. Okay, now your PDF is done processing. You can see that the JSON response from the API looks a lot like the JSON that we serialized previously when we were using the unstructured open source library. And so you can see the document elements here. Just like you did previously, you're able to explore this JSON using the JSON display function from IPython. The key thing here is that you're able to visually identify that all experimental results is a title and that gets the same normalized type as your title from your PowerPoint file and your HTML file, Now you can try this on your own with another file using the File Upload widget. Simply run this section of code. And then instantiate your widget. And then you'll be able to choose a file to add to your directory so that you can try it for yourself. You can even check in the example files directory to see if it appears. And it did, "el_ninu.html." You'll also see this file again later in the course. Now that the file is here, we can change the name of the file name in the partition HTML section so you can partition a file of your own. We'll scroll back up to that section, update the file name, and re-run partition. As you can see, the content changed. And so if you have a file of your own that you'd like to partition, you can use the widget to upload it and then partition it. So this concludes the first lesson. I encourage you to try this on your own documents, upload a few, and then experiment with the library and see what the outputs look like. In the next lesson, you'll learn about more important pre-processing techniques, including metadata extraction and chunking.