Hi! In this lesson, you will explore computer vision models and play with some of them to build a cool application. In particular, you will build an assistant that can help a visually impaired person understand what is in a picture. Let's get started! Okay, so just a quick heads up before starting the lesson. So for this classroom, the libraries have been already installed. But if you want to run the notebook by your own, on your own machine, make sure to install the required libraries, which can be installed through this following command. So yeah, let's get started. So first of all, we want to import the utility methods that we're going to use for this library. So make sure to import from helper the methods called loadImageFromURL and renderResultsOnImage. So we're going to quickly do that before starting. So for this lesson, we're going to focus on a specific task in computer vision called object detection. So first of all, we're going to load a pipeline object, as we've been doing it so far from our... previous lessons. So we're just going to import pipeline from transformers and load the object detection pipeline that we called od-pipe, that we will call using a model from Facebook called detr-resnet50. So just run the following to get started, to load the pipeline. And before moving forward, I wanted to give some insights on this specific task we're going to focus on. So what is object detection? So the task is to load object detection. So the task of object detection simply consists of detecting objects of interest in a specific image. So for example, as you can see on this image, the object detection model is able to detect all the relevant objects in an image. And one thing to notice is that object detection combines two subtasks, which are classification, but also localization. Because for each object that you detect in an image, you also have to provide the label of the instance, but also the localization of the detected object. You may be wondering, how did we choose this? Well, we chose the detr-resnet50 model from Facebook because today we have many state-of-the-art object detection models that you can use from the AI ecosystem in general. So for that, you can simply browse the HackingFace hub and use the filter object detection that you can see on the left and get all the available object detection models that you can freely download from the hub. And you can easily determine some metrics that you're going to use to select your models. So for example, you can select your model based on the number of downloads or number of likes. And also sometimes the authors provide the evaluation metrics of their models on some specific data sets directly on the model cards. So those are, I would say, the important metrics that you can use in order to select the model that you're going to use for your task. And for this lab, we're going to use the Facebook detr-resnet50 model. So now that we have loaded the pipeline, let's directly start using it. So let's directly use our pipeline by loading an image that we have prepared for you. So this is a recent image. This is an image that we took all together in a restaurant for filming our course. So yeah, we're going to see what are the objects that the model is going to detect in this image. So I would invite you to take a few seconds to make yourself familiar with the pipeline. Make sure to pass the image into the pipeline, get the output from the pipeline, and you can use the render results in image function that we have provided in the helpers in order to render the results directly in the image. Okay, so to get the results from the pipeline, simply call odpipe on the image. So we're going to do that right now. So pipeline output equal odpipe on the raw image. So yeah, once we got the prediction, I will just call render results in image by passing the image and also the pipeline output. And let's see the final results. So yeah, as you can see, the model has accurately predicted all the persons that are on the image. So with the corresponding bounding box, you can also see the confidence score of the predictions for each predicted instance. The model has also predicted the bottles, the cups, and forks. But yeah, it looks like the model was not able to predict the falafels, but maybe with some fine-tuning, we'll be able to do that. So yeah, now I invite you to pause the video and try that on your own with your own images. It can be some local images that you load using Peel, but you can also pass some image URLs if you want to use an image that is on the web. So yeah, once we get the final results, we're going to call render results in image function. And let's see the final results. Okay, so let's see how we can make that a bit more user-friendly. So let's say tomorrow you want to show that to your friend and make a nice demo using the model. So we're going to use a library called Gradio in order to expose a simple interface. So that's pretty much the final demo will look like something like this. So a simple interface where you can pass an image. Here we pass a prompt, but you can also pass an image. And then you get the results of the pipeline right next to the original image. That's pretty much it. That way you can share easily the demo with anyone and anyone will be able to try the model out of the box using their own image. Let's get started. For that, we'll first import Gradio as follows. So the Gradio interface expects to have a method where you pass the input and return the output. So we'll create a method that will do everything under the hood for the users given an image. So for that, we'll define a method called get pipeline prediction. So def. Def. Def. Def. Def. Get pipeline prediction that will take a pill image as an input and will split it up in two steps. So the first step will be to get the pipeline output given the pill image and we will use the globally defined only pipe that we have loaded beforehand so that we won't have to load the pipeline each time. And we'll call this method so that it will be much more efficient. And then we will use the helper function that we have used before render results in image directly on the original image using the pipeline output. And then return the final processed image. All right. And actually the code to make the Gradio interface is going to be pretty much straightforward. So once you have that method defined, you just have to create an object of Gradio.interface that will take as input this method get pipeline prediction. And we just have to properly define what is the input and what is the output and we'll just have to call demo.launch to launch the demo. So if you run this code snippet, you'll be able to define the demo and we will call demo.launch with share equal true so that you can also share a link for the demo to anyone. Okay. So once you have the demo app, you can load any image from your local computer and pass it to the model. So let's try out with this this image and click on submit. Yeah. So as you can see in this really cute picture, the model was able to detect both cats, the remotes, but also the couch, but we can't really see it because of this icon. But yeah, feel free to try it out with your with your own local images. You can also pass as I said, if you want to use the raw pipeline, you can also pass an URL to the pipeline directly. And also you can share this link to anyone so that they can try out the demo on their computer as well. You can also send this link to your friends so that they can try. Out the demo on their own, but please keep in mind to let your computer open and running because the demo will be running on your computer. Yeah. So for the last part of this lab, we'll see how to make an AI powered assistant using two different models. So let's say you have an image which is here. You just have to pass it to the object detection pipeline to get all the relevant objects in the image with their labels and then we can perform some sort of post processing of the output. Of the pipeline so that we have a more natural text that describes what's in the image and we can pass that image to the text to speech model, which is also going to be another pipeline, which is going to generate an audio that will narrate the text that we created in this step. And then we'll have the audio saying in this image we have this this and this so this is what we're going to do right now. And yeah, let's get started. Yeah, so we can combine the object detector model that we have just used with a text to speech model to help us indicate and dictate. What's in the image. So this is what we're going to do right now. So right now I'm going to imagine I have about 20 até 20 information stations. And so what I'm going to do is I'm going to call partnership Mutter punched them. And I'm going to say the next number. You can talk back to these various verstars if it some more information. ر parlez is he is already coming to an end on our deploying the�을 library food черезства image history missiles right now. So it's going to come and it's going to come in from hub at theNY sundown disability. So if this column steak while submitting answer like defeated or one in a row, each object will RS some program. In the mobile finger and then it will say, check ID network located Rocket heart on the picker not realized that you're. Yes, but your device. It has run as aumu data what theuration okay I will of an ID request. what's in the image, we can pass that summarize string to a text-to-speech model to dictate us what's in the image. So let's try that out. So we will use the same image that we used before and the same pipeline. So we're not going to rerun everything again because we will use the ODPipe that we have already loaded, which is here. And we will just have to import that helper method called summarizePredictionsNaturalLanguage from the helpers function that we have provided to you. And we will see what the text would look like if you pass the pipeline output to that method. So if you run this method and print out text, we'll get in this image there are two forks, three bottles, two cups, four persons, one bowl, and one dining table, which is a quite accurate representation on what's in the image. So you can also take your time and try to inspect what's in this method, but all the logic behind this method is pretty straightforward. It just tries to combine all the outputs that is in this array and try to make a sentence out of it. All right. And what model are we going to use to generate audio narration of the image? So we're still going to use pipeline, but as you have seen in the previous lesson, we're going to use the text to speech pipeline with this specific model from Kakao Enterprise. You can read more about the model directly on the model card of this model, but yeah, we're going to use this pipeline. And let's get the narrated text from the text to speech pipeline given the text. Perfect. Yeah. So in order to listen to the narrated text, we'll need to run this small snippet that uses IPython audio. So let's try out right now and try to listen what's in the narrated text. In this image, there are two forks, three bottles, two cups, four persons, one bowl, and one dining table. That's quite good and quite accurate. We can wrap the whole pipeline in a single method so that you just have to pass an image and it will return you directly the narrated audio. You can also wrap that in a Gradue interface so that you can let your friends try that out on their own. So yeah, I'll invite you to post the video and try that out with other images, maybe also other models that you can find on the hub so that you can compare different performances across different models and also maybe try to wrap everything in a Gradue demo so that you can share it with your friends.