In this lesson you will learn how multimodality is used in industry by implementing real life examples. You will analyze image content like invoices and flowcharts to generate structured output in different formats and stats. All right. Let's get building. In this lesson, you are going to work on three different applications of multimodality in industry. In the first example, the input is an image of structured data, like a receipt or an invoice. And then you extract structured fields and values from this image into a json format. In the second example, you start over with a table from a company's investors deck and we'll extract out a markdown tabular representation that then can be processed and used. In the third example, you'll get a language vision model to reason over logical flowchart and get it to output text or even python code that implements that logical flow. All right, let's code. In here you start with pretty much the same setup as you did in the previous lessons. So we need to ignore the warnings and also allow the vision API keys and then set up the, genai library, and that's good. And again, you're going to use pretty much the same helper functions. So, one that converts an output to markdown, but there is a slight a modification to the call_LLM function, where you have plain text boolean. So, the purpose of this is that depending of what we want, we may want to return the output as just plain text, or output it as markdown, which will come handy later on. As the first example, you're going to use the vision model to analyze this invoice. So, let's now ask a question. So, given the invoice file, let's try to identify the items on the invoice and then output the results as json. And we're looking for quantity, description, unit price, and amount. So let's run this and let's see what the vision model can extract from this. And I oh wow. Okay. We see for the the second item, "new set of pedal arms", the unit price of 15 and a total amount for 30, which matches exactly what we have on the table. And it's pretty accurate. So, yeah, this is pretty awesome. How about you ask now a reasoning question based on the input. So, maybe you could check how much would it cost for four set of pedal arms and also six hours of labor, which is actually quite interesting because the vision model needs to extract it from the description and then calculate the price on it. So, let's run this and see how well it can perform on calculation like this. And oh well like look at this. It actually was able to, deduct the price for both the pedal arms, but probably the thing I thought it would be the trickiest, the six hours of labor at $5 per hour. It was able to actually correctly calculate it, as a total 30. And then the full thing would cost us $90. Pretty awesome. For the second example you're going to analyze this table which has financial information about different units. and then you can see that there is some info around revenue, margins and year on year growth. So let's give it a prompt. And the first task is to print the contents, of what do you see in here, as a markdown table. And then, so that should be able to analyze every single part that we see, but also formatted in a nice, structure. So if you look at just the first item, food delivery, matches to 17%, 12pp and 15%. So this was pretty accurate. And now let's try something more. So, let's run similar query as before, but this time, the, reasoning message. So we want to find which of the business units has the highest revenue growth. And then this should take just a moment. And now that we see the results, we can see that the classifieds with 32% is the one that was recognized by the model. And if we scan it very quickly, we can see that this is, in fact, the correct answer. It's actually on par with Payments and Fintech, but this is still the right answer. So two thumbs up from me. And as the final example, you going to use the vision model to analyze this flowchart? So, let's try to ask it some questions. And so to analyze this flowchart, let's provide a prompt like this one, where you can ask the vision model to provide a summarized breakdown of the flowchart in the image in a format of a numbered list. And in here, we get one by one each of the items that we can see, going from this start to the big to the end. So, if we look at this, we started with a customer that places an order, and then it goes into a payment, which is part of the invoice, and so on and so on. And you can push the visual model even further, so you can ask it to analyze the flow chart in the image and even output the code that would implement this as a single function. And the thing is that because the vision models, there's some randomness to it, if we rerun the same function, we should be able to get a different function each time we run this. And on my second attempt, I go to a very different code, and this one looks at that is a lot more detailed. And if I scroll down, you can see that the function was implemented in the full entirety. So, technically we should be able to even execute it in a separate cell. So let's see if this function even executes. I could run this. And this worked. but probably breaks because it expects like the client and some other things. But at the very least, this is a good start for us that, we could take it as the basis to implement our own code and this is actually, quite powerful. It could be quite helpful for many of us. In summary, in this lesson you learned how to use vision model on different types of images and asking it to provide different explanation and extracting very, very different information, but also adding additional reasoning commands, to see if you could push those visual models to further, like calculate, new values based on the original ones, which is very powerful. And I encourage you to play by yourself and try all the things. And in the next lesson, you learn about a multimodal recommender system in which you'll be able to use, two different models, side by side, To perform search across different modalities. Great. I'll see you there.