Some document types, like PDFs and images, require models to pre-process. In this lesson, you'll learn about document image analysis techniques, such as document layout detection and vision transformers, and how to use these techniques to process PDFs and images. Let's get coding. You have learned so far how to pre-process However, some documents don't have structured information available within the document itself, such as PDFs and images. For these documents, you need to use visual information to understand the structure of the document. In this lesson, you will learn about document image analysis, which allows us to extract formatting information and text from the raw image of a document. In this lesson, you will learn about two document image analysis methods, document layout detection and vision transformers. Document layout detection uses an object detection model to draw and label bounding boxes on a document. Once it is drawn those bounding boxes, the boxes get labeled, and then the text gets extracted from within the bounding box. By contrast, vision transformers take a document image as input and produce text as output. These models can be trained to produce a structured output like JSON as the output text. As a note, vision transformers can optionally include a text prompt just like an LLM transformer can. Now let's learn about document layout detection. Document layout detection requires two steps. First, identifying and categorizing bounding boxes around elements within the document, and then extracting text from within those bounding boxes. So for the first step, you'll draw bounding blocks around each element within the document, such as narrative text or title or bulleted lists. Once you've identified and labeled those bounding boxes, you need to get the text out. Depending on the document type, there may be two methods for doing this. In some cases, the text is not available from within the document itself. In those cases, you'll need to apply techniques such as object character recognition or OCR to extract the text is not available from within the document itself. In other cases, such as in some PDFs, the text is available within the document itself. You can use the bounding box information to trace the bounding box back to the original document and extract the text content that falls within the bounding box. You can take a look below at what the architecture looks like for the YOLO X model, which is one of the models that's frequently used for document layout detection. And if you're interested, check out the archive paper below. For an example of what document layout detection looks like in practice, take a look at the example below. We've drawn bounding boxes around the various document elements, like the title, narrative text, and tables. Once you've identified those bounding boxes and labeled them, you extract the text. Now let's learn about another technique for document image analysis, vision transformers. In contrast to document layout detection, vision transformers extract content from PDFs and images in a single step, whereas document layout detection models will draw a bounding box and then apply, when necessary an OCR model, In this case, OCR is not required to extract the text from the image. One common architecture for vision transformers is the donut architecture or document understanding transformer. When you're applying a model such as the donut model, you can train the model to produce a valid JSON string as output, and that JSON string can contain the structured document output that we're interested in. You can see below what the architecture looks like for the donut model. And again, if you're interested in learning more, check out the archive paper. Now let's take a look at what vision transformers look like in practice. You'll take an image representation of the document. You'll pass that to the vision transformer. And once the vision transformer has processed the document, it will output the string. The string will be a valid JSON. And in this case, each element in the JSON will contain the text of the element and the category of the element. And once we have this string, we can convert that to the normalized document elements that we expect from all of our various document types. So when should I use a vision transformer? And when should I use a document layout detection model? Well, each model type has its advantages and disadvantages. For document layout detection models, some of the advantages are the model is trained on a fixed set of element types, and so it can become very, very good at recognizing those. Second, you get bounding box information, which allows you to trace the results back to the original document, and in some cases allows you to extract text without running OCR. The disadvantages include that some cases document layout detection models require two model calls. First for the object detection model, and second for the OCR model. Second, these models are less flexible. You work from a fixed set of element types. Some of the advantages of vision transformers are that they are flexible for non-standard document types like forms. And so they're able to extract information like key value pairs relatively easily. They're also more adaptable to new ontologies. Whereas adding a new element type was difficult for document layout detection models, you can add a new element type to a vision transformer potentially through prompting. Some of the disadvantages are that the model is generative, and so it is potentially prone to hallucination or repetition, just like a generative model is in natural language use cases. Second, these models tend to be much more computationally expensive than document layout detection models, and so they either require a lot of computing power or they run more slowly. Okay, now that we've learned about some techniques for pre-processing PDFs and images, In this exercise, you'll pre-process the same document first in an HTML representation and then in a PDF representation. And you'll see how you can extract a similar set of document types, whether you're processing the document using a rules-based technique, or you're extracting that information based on visual cues. So let's get started. First, you'll need to import some of the same dependencies that you've imported in the previous lesson. So we won't spend a lot of time talking about these, only to say since processing PDFs are model-based workloads, we're going to do that again through the API since that requires model setup. Now, let's take a look at the same document represented as a PDF and as HTML. You can take a look at what these documents look like by viewing the links below. In this case, the document is a news article about the El Niño weather pattern that was published in CNN. First, you can process the document as HTML. To do that, first use the file name, and then you can use the same partition HTML function from the unstructured open source library that you used in previous lessons. So if we pass the file name into that function, we can get the HTML elements. Now you can take a look at those elements. As you can see, it's identified some narrative text, it's identified some titles, in this case, using the HTML tags and some natural language information about the text itself. Now you can pre-process the PDF representation of the same document. First, The Fast Strategy extracts text directly from the document and can be used in cases of simple PDFs like this news article. And so if you pre-process this PDF using the fast strategy, You can also pre-process the document using a document layout detection model. In this case, you'll use the unstructured API to preprocess the document using the YOLOX model, which will draw bounding boxes around each of the elements and extract the text within them. You supply this information to the unstructured API using the high res model name parameter. Please be patient since this is a model-based workload, processing could take a few minutes. Once your API call has completed, you can see what the outputs look like. Again, you can see that the model identifies title and narrative text, and the output is very similar to the HTML and the fast strategy output. Once you have these outputs, you can compare them and see how similar they are. The HTML outputs have 35 elements, and you can also count the elements by type. Here you can see that it has 23 narrative text elements and 10 title elements. The document layout detection outputs have 39 elements, including 28 narrative text elements and 10 various kinds of title elements, including header and title in this case. So you can see the outputs are not exactly the same, but they're pretty close. And so it doesn't matter whether this document is represented in PDF form or HTML form, you'll get almost the same output and can treat the documents the same in your application. Now that you know how to preprocess PDF documents, you can try preprocessing a PDF document of your own. In the next lesson, you'll learn about preprocessing PDFs with more complex layouts, including tables. See you there!