In this lesson, you'll use natural language to prompt a zero shot object detection model, Owl-ViT, where, ViT stands for vision Transformer. You'll then use the output of this model as an input to SAM, which you used in the last lesson. You'll create an image editing pipeline by chaining these two vision models so that a user can give a natural language prompt and the pipeline outputs a segmentation mask. Let's get started. In the last lesson, we saw how to create masks based on point prompts or bounding boxes. In this lesson, we're going to see how we can use natural text to generate these masks instead. In order to do this, we're going to be using a pipeline of models, which means simply that the first the output of the first model will be fed into the second model. The first model in this pipeline will be a zero-shot object detection model, which will then be followed by a SAM model that will take the generated bounding box from this zero-shot object detection model, to generate the masks. The zero-shot object detection model we will be using is called. OWL-ViT. OWL-ViT model is a zero sharp object detection model, meaning it can detect objects within an image based on simple text prompts. The fact that it is a zero-shot model, means that you don't need to train it in any way for it to detect any object within an image. The way we will be using OWL-ViT within this lesson, is by using a text prompt, i.e. a string of text to generate a bounding box. We will not be covering in detail how the OWL-ViT model works, but we can cover some of the basics way in which it was trained. The OWL-ViT model was trained in two parts, one pre-training phase, and a second fine tuning phase. In the pre-training phase, the model learns to associate an image with a piece of text using a technique that leverages contrastive loss, and this process allows the OWL-ViT model to develop a strong understanding of both an image and its corresponding text. In order for this model to achieve good performance, it also required a fine tuning stage. During this stage, the model is trained specifically for object detection. While in the pre-training phase, the model was just learning how to associate a piece of text and an image, during the fine tuning stage, the model learns to identify, object and associate them with a particular word or string. Now that we have covered at a high level how the OWL-ViT model works, let's jump into some code and see how we can use it in practice. Before we start leveraging the two models we will be using in this pipeline. Let's start by creating a Comet experiment, which will allow us to compare the generated masks that will be produced at the end of this pipeline. We have created here an anonymous experiment, meaning you don't need to create a Comet account to get access to the Comet functionality. Now that we have initialized Comet ML, we can go ahead and create our first experiment, which will be used to track all of the work within this code walkthrough. For this lesson, we will be using the same image as we used in the previous lesson. The images of the two dogs sitting side by side. Let's first load in this image and display it within the notebook so we can remind ourselves what it looks like. In order to view this image, we will be using the image function from the PIL library. Instead of downloading the image directly from the internet, we will be leveraging Comet artifacts to download all of the images required for this lesson. As you can see, we now have the same image as we used in the previous notebook. Let's go ahead and load the OWL-ViT model from the Hugging Face hub. In order to access the OWL-ViT model, we will be using the Transformers library and specifically the pipeline method. There are many different versions of the OWL-ViT model available on Hugging Face. For this particular lesson, you will be using the base model OWL-ViT base. Now that the model has been loaded, we can go ahead and use it to identify the dogs in the image. The first thing you will do is specify what you want to identify in the image. In this case, a dog. You can then pass the text prompt to the detector model that you have just loaded to generate the bounding boxes of the dogs in the image. You will be passing here the text prompt as a list for the candidate labels parameter of the detector model, as it is possible to pass in multiple text prompts. For example, if you had an image with both a cat and a dog, you could pass the text prompt cat and dog to identify both animals in the image. Let's take a look at the output variable. As you can see here, the model has detected two different bounding boxes for the label dog in two different positions. This is quite promising. It means the model has most likely identified the two dogs in the image, but let's use a couple of util functions to plot the bounding boxes on top of the image that we have seen above. The preprocess outputs function simply takes the output above and reformat it to make it easier to plot the bounding boxes on the image. We can then use the show boxes and labels on image function to view the bounding boxes on top of the image of the two dogs. As you can see here, the OWL-ViT model has been able to take a text input dog and generate two bounding boxes highlighting each dog on the image. It was even able to overlap the bounding boxes of the two dogs. You have now successfully identified the two dogs on the image based on a simple text prompt. You can now move on to the next step of the pipeline, which is to use these bounding boxes to create segmentation masks for each dog. The approach we will be taking is very similar to the previous lesson. However, instead of using the fast SAM model, you will be using the mobile SAM model. As you will have seen in the previous lesson, the SAM model or segment anything model can be used to generate masks based on bounding boxes. The SAM model is quite large and requires a lot of computational resources in order to generate these masks. In this lesson, we will be using the MobileSAM model, which has been optimized to run on devices that might not have access to GPUs. In order to perform more efficiently, the MobileSAM model uses a process known as model distillation. Model distillation, allows you to take a very large model and transfer its knowledge to a smaller model, and this allows you to run the model a lot more efficiently. Model distillation is a different technique compared to model compression techniques or quantization, in the sense that it doesn't actually change the model format, but trains an entirely new and much smaller model. change the model format, but trains an entirely new and much smaller model. In this lesson, you will see that the MobileSAM model performs just as well as the larger SAM model. You can now load the MobileSAM model using the Ultra Analytics library. You will now use the SAM model we have just loaded, and the bounding box is generated from the OWL-ViT model to generate our first set of segmentation masks for each dog. Similarly to the previous lesson, we will be defining labels to specify where the bounding box belongs to the object we would like to segment, or the background. In our case, all of the bounding boxes belong to the object we would like to segment, and therefore the labels will be specified as one instead of zeros. Given we are implementing a pipeline where the output of the first model is used as an input to the following model, we will you be using the numpy repeat function to simply generate an array filled with ones where the length of the array is equal to the number of bounding boxes generated by the first model. We can then use the raw image, the bounding boxes, and the labels to generate segmentation masks using the MobileSAM model. As you can see, we have been able to create segmentation masks in under a second using the highly optimized MobileSam model. The model predict function returns a result object that contains not just the masks, but also the original image and additional metadata. We can then extract the masks from this object using the following code. The masks are simply a series of false and true booleans, which indicate whether a particular pixel belongs to the background of the image, or belongs to the mask that we have generated. In order to visualize the masks you have just created, we can use the show masks on image util function that we have created for you. As you can see, the MobileSAM model has been able to create two very accurate masks for both the dog on the right and the dog on the left. You have now successfully used the OWL model to go from a text prompt to a bending box, and the SAM model to go from the bounding box to the generated masks. Putting these two models together means that you now have a pipeline to go directly from text to a masked image. You can now go one step further and use the generated masks to edit the image. Let's now try a different use case where we would like to blur out people's faces in an image. You will start by loading in the image to see which faces you have to blur. As you can see, we have an image here with five different people and we will be blurring out each of their faces before using the Owl and MobileSAM model to perform the image editing, we will first resize the image to be just 600 pixels wide. This will allow us to perform the entire pipeline in a more time efficient manner. As you can see, the image is very similar to the high resolution image and has simply been downsampled. This process of resizing images is very useful if you're working with high definition images. Let's use a similar pipeline as we've used for isolating the dogs on the previous image, to isolate the faces on this image. The first step will be to define what the text prompt should be. In this case, we will be using the text prompt human face. As we are starting a new model pipeline, let's create a new Comet experiment to keep track of all of the different steps of this pipeline. You will start by creating a new experiment and logging the raw image to this experiment. You can now use the OWL model to create the bounding boxes of the faces in the image. Let's take a quick look at the bounding boxes generated to see if the. OWL model has been able to identify any faces in the image. As you can see here, five different bounding boxes have been found by the model. This makes sense as they are indeed five different faces in the image. In order to keep track of each step in the model pipeline, let's log the image and the bounding boxes to the command platform. We have created the make bounding boxes annotation util function to make it easier for you to log these bounding boxes to the Comet platform. You can now use the MobileSAM model with the bounding box prompts as an input to generate the segmentation masks. Now that we have generated the segmentation masks, we can go ahead and blur out the faces in the original image. In order to do this, we will start by creating an image that is entirely blurred. We will then be using the original image and using the segmentation masks created by the SAM model to replace only parts of the image with the blurred image you will be creating now. You can now use the blurred image you have created to replace just parts of the image. You will start by combining the masks generated by the MobileSAM model into a single mask that we will then use to replace parts of the image. Now that you have the single mask, you can go ahead and use the mask to replace parts of the image with the blurred out image. As you can see, the faces have been successfully blurred out in the image, and you can go ahead and log this image to the Comet platform. You have now successfully used the text prompt human face to replace all of the human faces within the image. What if you wanted to blur out not all of the faces within this image, but only the faces of the individuals that are not wearing sunglasses? Let's go ahead and give that a try. You can go back to the part of the code where we defined what the candidate label should be, and replace the text prompt human face with a person without sunglasses, for example. Let's go ahead and rerun all of the different parts of the pipeline to see if the model is able to blur out only the faces of the individuals that are not wearing sunglasses. As you can see, the model has generated a single bounding box, whereas there are four individuals that are not wearing sunglasses. This might indicate to us that something has not been performing well with this model. Let's continue with the model pipeline and then switch over to the Comet platform to better understand what is happening. By running all of the different steps of the pipeline. You can see that the models have replaced only the sunglasses in the image, rather than what we had originally intended. As you can see, the model pipeline is not performing as expected. Let's switch over to the Comet platform to better understand where the model is failing. You are now in the Comet platform, which you can use even if you do not have a Comet account. Let's start by comparing all of the different images that we have generated side by side. To do this, you can add a new panel. Select the image panel. Choose the images you would like to compare. In this case a raw image, the OWL image, and the final blurred image. Let's dig into one of the experiments to better understand what is happening. In the original code walkthrough, you use the text prompt a human face and as you can see, the OWL model has successfully identified all of the different faces in the image. As a result, the MobileSAM model was able to segment at each face and replace it with its blurred version. As you can see here. For the second prompt that you used, a person without sunglasses, the OWL model created a bounding box around the sunglasses rather than around the faces of all of the individuals in the image. As a result, the SAM model then blurred out the sunglasses themselves rather than the faces of the individuals not wearing sunglasses. Comet allows you to visualize a number of images. As you can see here, we did not log the image with the bounding box to the Comet platform. Instead, we logged the image and the bounding box and the Comet platform combines the two for us here. Using Comet, you can even turn off the bounding boxes to view the image. When logging data to the Comet platform, you do not have to log a static image. Instead, you can log the image and the bounding boxes separately. This allows you in the Comet platform to filter out the bounding boxes based on the confidence score of the model, for example. You can change here the confidence score from 74.3 to 80. To better see which bounding box the model is very confident in. And as you can see, the model has different confidence scores depending on the bounding boxes that have been generated. You can even turn off the bounding boxes altogether. Now that you have used the model pipeline a couple of times, why not tweak the text prompt to get the model to blur out the faces of all of the individuals in this image that are not wearing sunglasses? Or maybe even better, Why not try it with one of your own images? We have also added a couple of images in this notebook for you to try out the techniques you have learned on new images. You could use the image of people in a cafe and provide their faces. Or why not blur out the faces of people walking on the street or in the metro? Even better, you could provide the faces of the people in this photo. In this lesson, you will have created your first pipeline using the output of one model, the OWL, to serve as an input for the second model, in this case MobileSAM. In the next lesson, you will be looking at how to use stable diffusion to perform inpainting. Let's go on to the next lesson.