Tables are an important element of many documents. In this lesson, you will learn how to extract tables from documents and infer their structure. Let's have some fun. Now that you've learned how to extract information from PDFs and images using visual cues, in this lesson, you will learn more advanced pre-processing In the bottom right, you can see some financial information that's contained within a table that exists within an unstructured document. Table extraction enables RAG applications to extract information that's contained within tables in unstructured documents. Some document types, such as HTML and Word documents, contain table structure information within the document itself. For example, in the table tag in HTML. For these document types, you can use rules-based parsers to extract table information. For other document types, however, such as PDFs and images, you need to use visual cues to first identify the table within the document and then process it to extract information from the table. There are a few techniques you will learn about to accomplish this, table transformers, vision transformers, and OCR pre-processing. First, you'll learn about table transformers. A table transformer is a model that identifies bounding box for table cells and then converts the output to HTML. There are two steps in this process. First, identify tables using the document layout detection model that you learned about in the previous lesson. Then, once you've identified tables, you can route the tables to the table transformer. There are some advantages to this technique, including the ability to trace cells back to the original bounding boxes. You have bounding box information about each individual cell within the table. The disadvantages that includes multiple expensive model calls, including multiple document layout detection calls, and then multiple OCR calls. You can see below what the architecture looks like for the table transformer model. And if you're interested, check out the archive paper linked below. You can also extract table content from PDFs and images using vision transformers like the ones you learned about in the previous lesson. However, unlike in the previous lesson, when the target output was JSON, in this case, the target output will be HTML. An advantage of this method is that it allows for prompting, is more flexible, and only involves a single model call. The disadvantage is that it's generative, prone to hallucination, and you don't retain any bounding box information. For an example of what this looks like, you can take a look at the example below. You start with the image of a document, you run it through the Vision Transformer, and then on the other side, you get the text representation of the HTML. A final technique that you can use for pre-processing documents You'll OCR the table and then process the table using rules-based methods. An advantage is that it's fast, and for well-behaved tables, And you can process this OCR output for a table that you detect using a document layout detection model. And you can post-process this OCR output to construct an HTML representation of the same table. Okay, now that you've learned a few techniques for extracting tables from unstructured documents such as PDFs and images, you can put that into practice. First, you can import a few of the helper functions that we've used in the previous lessons. Since table extraction is model-driven, again, you'll use the unstructured API because that takes care of all of the models set up for you. In the previous lesson, you saw an example of fairly simple PDFs. Now you'll pre-process a more complex document that includes tables and images. You can see what this document looks like below. In particular, you'll be extracting the content from table one within this page of the document. Now that you checked out the document, you can pre-process it. To pre-process it, you can pass the document to the unstructured API. In this case, you should take note of a few parameters that we're using. This parameter, infer table structure equals true, tells the API that we want to extract table information. The skip infer table types parameter instructs the API that we don't want to skip table extraction for any document types. In this case, we're interested in the tables. Now you're ready to run your API call. Remember, this process requires multiple model calls and may take a few minutes. Once the API has completed your request, you can filter down to just the tables in the document using a similar filter operation to what you've seen earlier in the course. In this case, we're going to look for elements that have the category table. Now that you have your table element, you can see the text of the table in the document using the text attribute on the element. As you learned earlier, however, it's also helpful to have an HTML representation of the table so that you can pass the information to an LLM while maintaining the table structure. This information is available in the text as HTML field within the element metadata for tables. You can use this code block to view what the HTML in the metadata field looks like. As you can see, this is an HTML representation of the table in the document. If you're interested in displaying this table, you can also do that using the HTML display function within IPython. Once you've extracted the table content from the document and converted the table content to HTML, it can be helpful to summarize these tables so that you can search on these tables when you perform similarity search within a RAG architecture. To do this, we will use a few utilities from LangChain, You can also learn more about LangChain through Once you have imported these helper functions, you can instantiate the summarization chain and then summarize the table HTML content. As you can see, the model successfully summarized the table. There is an HTML representation of the table as well as a summary that you can use to search over when you perform similarity search within your RAG system. Now that you know how to extract table content from PDFs and images, try it on a few of your own files. If you're interested in playing around with a few parameters, also try changing the Hi-Res Model Name parameter to Chipper so that you can compare vision transformer outputs to table transformer outputs. In the next lesson, we'll put all of your skills together to build a Rag bot of your own that works on a variety of data types.