Preprocessing Unstructured Data for LLM Applications

Loading...

Welcome back!

We'd like to know you better so we can create more relevant courses. What do you do for work?

Subscribe to receive AI news, events and course updates from DeepLearning.AI!

Course Syllabus

AI Python for Beginners is a sequence of 0 connected courses. You can navigate to the other courses by clicking on the cards below

Explore Courses
Community
My Learnings

You’ve achieved today’s streak!

Complete one lesson every day to keep the streak going.

Su

Mo

Tu

We

Th

Fr

Sa

You earned a Free Pass!

Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.

Now that you have learned all these techniques, in this lesson, you will put them all together and build a Let's code and innovate. Okay, now you're ready to build your own Rag bot. What you want to do is take a corpus of documents with a bunch of different document types that talk about the donut model that you learned about earlier in the course. You'll pre-process all of these document types, load them into a vector database, Once you've asked the question to the LLM, you'll get a response. Now you're ready to begin. You can start by, again, importing a few helper functions. Many of these functions you've already seen throughout the course. For this application, you'll also need a new function, partition markdown, because this corpus will include markdown files. If you wanted to include other document types, such as Word documents to the corpus, you could include another function, such as partition docx as well. You can also set up your unstructured client because this corpus will contain PDF documents in addition to documents that require rules-based parsing. Now that you're ready to build your Rag bot on top of the donut model documentation, The first document you'll include is a PDF of the Donut paper from Archive. This is a PDF that includes complex tables. You'll also include a PowerPoint deck that includes information about the Donut Model, along with the README from the GitHub repo that contains the Donut Model, which is in Markdown format. You can get started by preprocessing the PDF. You can apply what you learned earlier in the course and use the unstructured API to call the YOLOX model to preprocess this document using a document layout detection model. As you learned earlier in the course, this is an expensive model-based workload, and so if this takes a few minutes to process, don't worry. Now that you've pre-processed your PDF, you can take a look at the results. Here's an example of one of the PDF elements that you just processed, in this case, the header. You may also want to take a look at some of the tables you'll be able to query over once you've assembled your RAG application. Again, the HTML representation of the table is in the text as HTML field within the metadata objects. You can take a look at one of the tables in the document here. This PDF also contains some information that we may not be interested in querying over. You can use some of the metadata that we extracted while pre-processing to filter out unwanted content. First, you can filter out the references section. The references section does not contain narrative content, and so for this application, you may not want to query over this content. To find elements that belong to the references section, you can apply what you learned in the metadata section. In particular, you can find elements that are nested under the references element by using the parent ID metadata field. First, you can get the reference title. When you convert the reference title element to a dictionary, you can see the element ID. That's what you can use to filter out elements that belong to the references section. First, you can save that ID. Then, once you have that ID, you can look for elements that have that element ID as the parent ID. You can take a look at a few of those elements here. As you can see, when you filter based on that, you get the items within the references section. And so to get rid of those because you don't want to search over them within your application, you can just use a filtering operation. Now your element set does not contain any elements from the references section. When you preview the document, you may have noticed the header information on some of the pages. The header information looks like this. It has the title of the document at the top of each page along with the page number. That information breaks up the narrative structure of the document, and you may want to exclude that content when you're searching over this document in your application. To do this, you can simply filter on the metadata field category so that you remove the headers from the output. Once you've filtered out these elements, your element set no longer contains headers, which again, contain information that you're not interested in querying over. Now you're ready to pre-process your PowerPoint slide. Just like you learned in an earlier lesson, you simply need to apply the partition PPTX function from the unstructured open source library. Similarly, you can use the partition MD function from the unstructured open source library to partition your Markdown file. Now that you've pre-processed all of your documents, you can combine those documents into a single corpus and chunk them using the chunk by title function that you learned about during the metadata and chunking lesson. After chunking the documents, you can load the documents into a vector database. In this case, you can use utilities from LangChain For this application, you may want to search for content that belongs to a specific file type within the corpus. And so when you load the documents into the vector database using the line chain utility, you can include the source as a metadata field. Once you've created documents to load into the vector database, the next step is to embed those documents, in this case using OpenAI embeddings. And then you can use the from documents method on the Chroma Database object from LangChain to load the documents. When the documents load, they'll run through the embedding process, and then once they're embedded, they'll be uploaded into the Chroma Vector Database. Now you can set up a retriever to search over the database. In this case, you'll search on similarity and you'll retrieve six results before building your prompt to pass to the LLM. Okay, now your vector database is set up. The next step is to set up a prompt template. In this case, we can use LangChain to help manage our prompt template. In this case, you've created a prompt that instructs the LLM to say, I don't know if it doesn't know the answer to the question you are asking. I'd encourage you to play around with the template on your own and see how it affects the results. Now that you have your template set up, you're ready to query your LLM. Simply load the conversation retrieval chain from LangChain, To do this, you'll use the conversational retriever chain from LangChain. If you're unfamiliar with that, don't worry. You can learn more about that through other courses at deeplearning.ai. Now that you've instantiated your chain, you can ask a question. In this case, you can ask, how does Donut compare to other document understanding models? The model has given us the response and it has correctly indicated that Donut is a document understanding model that does not rely on OCR in contrast to other document understanding models. In addition, because you included metadata about the file name when you uploaded the documents to the vector database, the model can cite the sources of this information, in this case, the donut paper and the slide deck describing donut. You may also be interested in finding information from a specific source within the corpus. You can apply what you learned about hybrid search to accomplish this. In this case, you may be interested in learning about information that you know is contained within the donut_readme. In this case, you can set up a retriever to filter on one of the metadata fields that you extracted while pre-processing the documents, the file name. With that retriever, you can set up a new chain, and then you can run a new query. In this case, you can ask, how do I classify documents with Donut? Now, the model has responded with content that was contained within the readme for the Donut model. You can try this on your own and play around with asking the model a few other questions and maybe try including other metadata fields to filter on. Now that you've completed your Rag bot, try to improve it on your own. Maybe include a few of your own files, try asking some other questions, or include some other metadata fields in your hybrid search. Now you're prepared to create your own Rag bot based on information that's important Happy coding.

course detail

How Was Your Experience

Thank you for taking the time to provide feedback on your course experience! Please take a moment to rate the course and share any comments you may have.

Would you recommend this short course to people in your network? (0=Not likely, 10=Extremely likely)
012345678910
Feedback about the Course:
Feedback about the Platform:

Loading...

Learn Code

Next Lesson

Preprocessing Unstructured Data for LLM Applications

Introduction

Overview of LLM Data Preprocessing

Normalizing the Content

Metadata Extraction and Chunking

Preprocessing PDFs and Images

Extracting Tables

Build Your Own RAG Bot

Conclusion

Appendix - Tips and Help

Course Feedback

Community

0%