In this lesson, you will learn all the foundations of LLM Data pre-processing, Let's dive in. Before you learn how to pre-process specific data types, First, it's important to understand retrieval augmented generation or RAG. This is a technique for grounding LLM responses on validated external information, such as your corporate information. This information can be contained in various document types such as emails or PDFs or PowerPoint slides. Conceptually, RAG applications load this external context into a database and then retrieve that content, insert it into a prompt, and then pass that prompt to an LLM. That way, by the time a question or a prompt gets to the LLM, it contains external information that the LLM can use to construct its response. Within industry applications, this is extremely relevant because organizational data exists in a wide variety of formats. Pre-processing documents includes a few steps. First, you need to extract the document content. This is the text content from the documents that can be used to construct your prompts. There's other information that's important to extract as well, including document elements. These are the basic building blocks of a document, such as titles, narrative text, lists, and tables. You can use these document elements for a variety of tasks that are important to RAG applications, such as chunking and filtering. You will learn more about these operations later in the course. In addition, it is important to extract element metadata, such as page number or file type. Later in the course, you will learn how to use this information when performing hybrid search, which allows you to filter the information that you extract from a vector database when constructing your prompt. So why is data pre-processing hard? Well, there are a few reasons. First, different document types have different content cues. So an HTML file, for example, may use tag names to give an indication as to whether a piece of text is a title or a list. PDFs will have visual cues, so an entirely different cue. And so being able to pre-process all of these different document types in a common manner requires you to understand how different document types indicate what a different element is in that document type. Second, documents come in a variety of formats, and you need to standardize these so that your application can process them in the same way. So ideally, your application just needs to know what the external information is. It doesn't necessarily care whether the source document is an HTML or a PDF. By standardizing these documents, you can treat all of these documents in the same way in your end user application. However, standardizing is hard because, again, these documents have different formats and different cues as to what the element types are. Additionally, different documents have different formats. And so, for example, a journal article and a form are different. And so data pre-processing techniques need Finally, you need to understand information about document structure in order to extract meaningful metadata that you can use for various operations in RAG applications, such as filtering, that you'll learn about later in this course. All in all, there are a lot of moving parts when it comes to data preprocessing. And it's not very simple, but it's an important part of getting up and running with a RAG application. In the next lesson, we'll jump in and we'll learn how to pre-process a few different document types, including PowerPoints, HTML, and PDFs.