Prompt Engineering for Vision Models

Loading...

Welcome back!

We'd like to know you better so we can create more relevant courses. What do you do for work?

Subscribe to receive AI news, events and course updates from DeepLearning.AI!

Course Syllabus

AI Python for Beginners is a sequence of 0 connected courses. You can navigate to the other courses by clicking on the cards below

Explore Courses
Community
My Learnings

You’ve achieved today’s streak!

Complete one lesson every day to keep the streak going.

Su

Mo

Tu

We

Th

Fr

Sa

You earned a Free Pass!

Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.

You can prompt stable diffusion with text, but also with images and masks. Additionally, you can tune some hyper parameters such as guidance, scale, strength, and the number of inference steps to better control the diffusion process. Let's generate some images. In previous lessons, you've learned how to use computer vision models to extract information from your images in the form of bounding boxes and segmentation masks. And these next lessons, you're going to learn how to use computer vision models to add information to your images using diffusion models. Before we dive in, let's first review a little bit about how diffusion models work under the hood. If you're interested in going deeper, I recommend going through DeepLearning.AI's introduction to Diffusion Models course. Note that this course is not going to get into the mathematics diffusion models. We're aiming to give a high level overview of how they work, so that you can do these exercises productively. Diffusion models, like the one you'll be using in this lesson, are really incredible. They're capable of generating unbelievably detailed images from text prompts. And they do this via a process called diffusion, which at a super high level, works like this. The model generates an image of pure gaussian noise. Then the model denoise is the image gradually one step at a time. Eventually the model removes enough noise to leave us with an attractive image. In a way, we're kind of lying to the model. We give it an image of pure noise, but we tell it that underneath the noise there's actually an image of a dog. Intuitively, you can think of the model as sort of chiseling the image out of that noise. Mathematically, the model is transforming a sample from a simple distribution, like gaussian noise, into a complex learned distribution, like images. Our text prompt shifts our target distribution, guiding the model towards chiseling out an image that satisfies the prompt. The model that does the actual denoising is a U-Net model, and we use a text encoder to generate the prompt embeddings for guiding our generation. The model you'll be using, however, can take more than just text as a prompt. You can also provide it with images, masks, and various hyperparameters to better control the diffusion process. You can even direct your model to take an initial image and use diffusion to edit just a specific portion of the image via a technique called inpainting. In this lesson, you're going to learn to do exactly that. Let's dive into the first exercise. To motivate this exercise, imagine that one of your less computer savvy friends asked you to edit a photo of them. They gave you a photo of them leaning over a cat, and asked you to replace the cat with a dragon. You could spend days using Photoshop to edit the image, or you could use a diffusion model to do it in minutes. To get started, we're going to download our masks and images from Comet as artifacts. You've already learned about artifacts previously, but to recap, artifacts are just version controlled assets. They can be anything from models to data sets, or in this case, your friend's photo and a segmentation mask of a cat. Once our artifact has been downloaded, we can extract the images from them and view them using matplotlib. As you can see, we now have an image of our friend leaning over a cat and we have a segmentation mask of the cat. This is all we need to get started with our diffusion editing process. Before we initialize our model, we want to import Torch and we want to set our device correctly. If we have access to GPUs, we want to use those, but if not we can use CPUs for this. Once our device is set, we can initialize a stable diffusion inpainting pipeline from the diffusers library. You should notice that we're using the stable diffusion to inpainting model, which is a diffusion model that was trained specifically for these inpainting tasks. We're also using different dimeric formats depending on our hardware situation. This should save some memory. Once our pipeline is initialized, we want to start thinking about the parameters we're going to use to generate our image. The most important and maybe most overlooked parameter is the seed. We want to set our seed number manually so that our results are reproducible later. In addition, we'll need our prompt. We'll need our image. We'll need our mask image. And we'll need a hyperparameter that we haven't discussed yet. In this experiment specifically, we're going to be looking at the number of inference steps. With diffusion pipelines, you can control how many inference steps the pipeline takes and de-noising the image. When you allow the pipeline to take more infrared steps, there's a more gradual denoising process. Oftentimes, this can lead to a higher quality image. However, there's a point of diminishing returns where you're simply wasting your compute power. You can also end up with an image that looks a little overly processed, where the features are a little too smooth or the colors aren't exactly realistic. Returning to our code, let's say ready to generate some images? First, we'll initialize a new Comet experiment, that we can use to log our images, our metadata, and our parameters. Then we'll pass our parameters into our stable diffusion pipeline and generate the output. From there we can extract the image, and then we can log the image, the prompt, the seed and the number of inference steps, All to Comet. When we're done, end our experiments and analyze our results. Now we're not going to actually run this code in this environment because it'll take a while on CPUs. But we went ahead and ran this code in another environment and logged the results of a Comet experiment, which we can view now. This image was generated using the exact same code as above. Just run on a system of GPUs so it could run a bit faster. As you can see, we're not quite there yet. We've definitely painted over the cat, but instead of a dragon, we have a frankly terrifying sort of skeletal lizard looking thing. The first thing we might try in this situation, is simply increasing the inference steps to a much larger number. Let's say 100. If you look, this is the exact same code as before. We've just changed the inference steps from 3 to 100. As before, we're not going to run this code, but we do have another Comet experiment that was run using this exact same code, and we can view its outputs right now. Okay. This is looking much better. However, we still have some features that look a little odd and I think we could do better. The next thing we might try is tinkering with another hyperparameter called the guidance scale. The guidance scale is a numeric value that determines how closely the model should follow the prompt. Basically, you can think of it as scaling the effect that the text input has on the target distribution during the denoising process. A higher guidance scale means a higher prompt fidelity, but potentially lower image quality. You can see this example here. I gave it a fusion model, a prompt, an oil painting of a tabby cat dressed like a character in the Matrix. When the guided skill is set lower, we get an image of a cat. That's a pretty good painting. However, there's not a lot of matrixy things here. When the guidance scale set higher, in this case twice as high, we get an image of a cat who has some matrix sunglasses going on, but the image of the cat itself is not very good. One thing to note about the guidance scale, is that the default is going to be different for each diffusion model. There's not a standard across the field. So for Stable Diffusion 1.5 which is a smaller model you might use on CPUs. 7.5 is the default value. For other models it's five. For other models it's one. And you really just need to check the model specifications to know what's right. Returning to our code, what we want to do next is running an experiment where we try several different guidance scale values. To start, we'll make a numpy array of different guidance scale values we'll use in our pipeline. With the guided scale set. We'll run some code that's very similar to the code we wrote previously. The difference being that this time we're running multiple experiments and that we're passing the guidance scale parameter into our pipeline as well. Again, we're not going to run this code here. It would take quite a long time to run on CPUs. But we do have another Comet experiment, where we run the same code and log this outputs already. By calling display on our experiment and passing in the tab equals images argument. We're actually able to see the images dashboard of the real Comet application. As you can see here you can see all the image metadata. And here, we can see our different outputs. These two images seem to be our best. These are a guidance scale of ten and at 20. So we know that our best output seem to occur when the guidance scale is somewhere between 10 and 20. We might do a further experiment, trying more guidance scale values between 10 and 20 till we find the perfect number. For now, let's move on to trying a different hyperparameter. The next hyperparameter we're going to look at is one that is unique to image to image applications of diffusion models. This is called the Strength Hyperparameter. When we use a diffusion model to edit an image, as we're doing in inpainting, we have to add a lot of noise to it to delete the section that we want to replace. How much noise we add to it decides how much information gets removed. If we add a little noise, the image will look very similar the way it looked originally. If we add a lot of noise, the section we want to replace will be slightly removed. Strength is what determines how much noise gets added. Let's run an experiment, with strength similar to our guidance scale experiment, where we iterate across a number of possible values. As you can see here, the code is largely the same. We're creating a set of possible values for strength. We're iterating across them or passing them all into our stable diffusion pipeline. For each of our outputs, we log the image and the metadata to Comet so that we can analyze it later. As before, we're not going to run this experiment here, and we're instead going to view the results of this code via Comet experiment. As we can see, when strength is very low, the cat has hardly changed at all. As strength increases, more and more of the cat is removed, and unfortunately, changing the strength doesn't seem to have led to a better image for us. That's okay. That's part of the fermentation process. The final input we want to explore adding to our stable diffusion pipeline is the negative prompt. The negative prompt should be very intuitive. It's simply a prompt that tells the model what you want the image to not look like. So, because we used a realistic green dragon as our prompt, we might give it a negative prompt like cartoon. This should reinforce the realism of our output. Again, you can see that our code is largely unchanged. We're using the best results for previous experiments, setting the guidance scale to ten using 100 inference steps. If we have access to a GPU and we're also passing in this negative prompt. Again, we're not going to run the code here, but we've run this code elsewhere and log the output to a Comet experiment which we can view right now. And look at that. Refine our dragon to get rid of some of the features we didn't love in the previous generations. While, enhancing the good qualities. With everything you've learned here and everything we've covered in other lessons. You should be able to build a pipeline where you take an image, generate a mask, use inpainting that edit it, and do it all in a streamlined fashion, sort of like a stable diffusion powered Photoshop. That's it for this lesson. In the next lesson, we're going to go one step further with diffusion models. And we're going to learn how to teach a diffusion model to generate an image of something. It's never seen before.

course detail

How Was Your Experience

Thank you for taking the time to provide feedback on your course experience! Please take a moment to rate the course and share any comments you may have.

Would you recommend this short course to people in your network? (0=Not likely, 10=Extremely likely)
012345678910
Feedback about the Course:
Feedback about the Platform:

Loading...

Learn Code

Next Lesson

Prompt Engineering for Vision Models

0%