Diffusion models are powerful tools for image generation, but sometimes no amount of prompt engineering or hyperparameter optimization will yield the results you're looking for. When prompt engineering isn't enough. You may opt for fine-tuning instead. Fine-tuning is extremely resource intensive, but thankfully there's a method of fine-tuning stable diffusion that uses far fewer resources, called Dreambooth. In this lesson, you'll learn how to fine-tune stable diffusion with just a few images in order to produce your own custom results. Let's jump in. Previously, you've seen how manipulating your prompts and parameters can allow you to control the output of diffusion models. Now you're going to explore ways of teaching a model to generate images of a subject it's never seen before. For example, how might we teach a model to generate an image of Andrew as a Van Gogh painting? To achieve this, you're going to learn two new techniques for fine- tuning diffusion models. The first technique is called Dreambooth. DreamBooth is designed to teach diffusion models new subjects using very small amounts of example data. In this project, for instance, we'll only be using six images of Andrew. The second technique is called Lora, which stands for low rank adaptation. We're going to get into more detail with LoRa later, but for now, let's start with Dreambooth. Before we talk about the mathematics behind DreamBooth, first think about why it might be difficult to fine-tune a diffusion model with such a small data set. What problems might this cause? The first and most obvious issue has to do with the robustness of your data. Six headshots of Andrew might be enough for the model to generate a decent headshot, but it's unlikely that the model is going to be able to generate images of Andrew in other contexts. The second issue has to do with language drift. When we fine-tune models, we're always walking this tightrope between improving the model's performance on our downstream task and damaging the model's performance on our other tasks. In other words, imagine we aggressively tune the model with our six photos of Andrew, each pair with a prompt like a photo of a man named Andrew. It's very likely that the models embedded definitions of photo and man will begin to drift to match the distribution of our small, narrow data set that will then damage the quality of the model, if we use it later to generate anything besides a photo of Andrew. For example, it might think that all men look like Andrew, or that every photo needs to have Andrew in the frame. DreamBooth solves both of these problems, in large part by leveraging the diffusion model's existing understanding of the world. The authors of the DreamBooth paper call this "the semantic prior knowledge." You can think of this like using context clues when you're reading a new word. When the model sees a prompt like a photo of a man named Andrew, it might not know what Andrew looks like, but it knows generally what a photo of a man looks like. It can then leverage this information in training, allowing us to get by with a much smaller example dataset. More concretely, this is what DreamBooth looks like in practice. First, you select a token which the model rarely sees, like this bracketed V. Your training loop will overwrite the information in the model associated with this token. This is similar to many other techniques for altering diffusion models you might run into. You then pair this rare token with another token that generally describes the class which your subject belongs to. For example, for Andrew, we might select the token pair brackets for man. We then associate prompts which include this token pair with our images of Andrew. So for example, if we had an image of Andrew playing basketball on our dataset, we might associate it with the prompt a photo of a "[V]" man playing basketball. The model should be able to use its prior understanding of all the other words in that sentence to guide its generation. So, even if we're not giving the model a super robust dataset because it knows generally what a photo of a man playing basketball should look like, it can hopefully pick up on the specific differences that differentiate the photo of Andrew playing basketball from a general photo of a man playing basketball. After fine-tuning, the model should associate the token pair "[V]man" specifically with Andrew. Now, this might solve the issue of having a small dataset, but it does not solve the language drift issue we discussed earlier. In fact, based on everything we describe so far, it might seem to you like we're almost certainly going to destroy the model's definition of man in the training process. DreamBooth addresses this problem in a really clever way. The authors of the paper created a custom loss metric, which they call a prior preservation loss, which implements an interesting form of regularization. Prior preservation penalizes the model for drifting too far from its existing understanding of the world. Generally speaking, it works like this: You create a data set of images paired with props. In this case, you might have six photos of Andrew with captions like "a photo of a [V] man". In DreamBooth parlance, we call these the instance images and instance props. In this project, you're going to use a single prompt for every single one of your images, but you should experiment with writing custom prompts for each image. You can even use a model like BLIP to automatically generate the captions for your images, which is really useful when you're working with a larger number of images. You then select a prompt that is representative of all the concepts you're afraid your model was going to lose during fine-tuning, and the DreamBooth literature we call this the class prompt. For example, if your instance prompt is "a photo of a [V] man", your class prompt would simply be a photo of a man. Using the class prompt, you generate 100 to 200 images from the diffusion pipeline. With these images, you can construct a robust distribution representing the model's prior understanding of these prompt concepts. During the training process, you then calculate two losses. First, you calculate the loss against the instance data. Basically, how good is the model at reconstructing these images of Andrew when we give it our modified token pair. Second, you calculate the loss for the class data. In other words, how close does the model come to generating an image from the prior distribution when prompted with the class prompt? Because the class distribution is sampled directly from the model before you conduct any fine-tuning, it gives you a concrete basis for measuring how far the model has drifted. Basically, if given the same prompt, the model generates a wildly different image before and after fine-tuning, we know that there's been some drift. You can then combine these two terms which we'll call the instance loss and the class loss to give you a comprehensive loss metric. Now that we're familiar with DreamBooth from a theoretical perspective. Let's try actually implementing it. Much of the code you'll be using throughout this section is adapted from Hugging Face's DreamBooth training scripts, which you can find link to the bottom of this notebook. You're encouraged to explore these examples to learn about all the complex optimizations that are available here. In this example, we've done everything we can to simplify this code to highlight the high level concepts. You'll also be using Comet to log your training metrics throughout this exercise. Comet automatically integrates with many of the libraries you'll be using throughout this project. So, it's really important that we import and initialize Comet before we import any other libraries. Most of our focus throughout this project is going to be on the hyperparameters, and how we can tune them to control the outputs of our diffusion model. We'll be storing our hyperparameter in a simple dictionary that we can update throughout the lesson. Most of these terms should be familiar to you already, but there's a couple things we want to call out here. First, you should notice that we're importing a different model depending on whether or not you have access to a GPU. If you have a GPU, we're going to use Stable Diffusion XL. This is one of the larger stable diffusion models and has really high quality images. If you're running on a CPU, we're going to use stable diffusion version 1.5, which is an earlier model that can still generate high quality images. It's just not quite as powerful. We have our instance prompt in our class prompt. We also have a manual seed that we're setting to make sure that our results are reproducible. You can see here that we set the resolution to 1024 pixels if we have a GPU, and if not, we set it to 512. This is because different stable diffusion models do better at different sizes. Additionally, we're going to set our number of inference steps to be 50. Our guidance scale to be 5.0. We're going to by default generate 200 class images. And we're going to set the prior loss weight to 1.0. The prior loss weight is a numeric value that scales our class loss when we're calculating our prior preservation loss. If all we care about is making sure our model learns our new concept, we can set this to zero. If we're really concerned with making sure the model doesn't drift during a training process, we can increase this to a higher value. Next, we want to initialize a Comet experiment that we can use throughout the project. Before you can start training your model, you're going to need to generate your class images. We're going to write the code for generating your class images. We're not actually going to run it because we're doing it on a CPU can take quite a bit of time. Instead, we've provided the entire dataset of class images as a Comet artifact that we can just download immediately. As a refresher, a Comet artifact is an asset that's stored in Comet and version controlled. The easiest way to generate your class images is to use this DreamBooth trainer or utility class we provided. We're going to use this a lot throughout this project. To abstract away some of the boilerplate you'll need to run your training pipeline. Now, if at any time you get curious about what's actually going on under the hood with these utility methods, you can always use the double question mark operator to look inside the source code. To download our class data set from Comet. We're going to make use of the experiment we initialized earlier. Once that data set has been downloaded, we can take a peek at it by using the display images method on our trainer. As you can see, stable diffusion 1.5 does a decent job of generating images of men. Although it seems to have a theme for black and white photos and mustaches. Next, we need to download our instance data set, which consists of images of Andrew. We have that saved as a Comet artifact as well. We can download it using very similar code. Once it's downloaded, we can take a peek inside it using the same display images method we used earlier. There are two really important things to note when looking at this dataset. One. Notice that the images are of different sizes and a different quality. This isn't a dataset that we did a lot of preprocessing on. This illustrates that with DreamBooth, you can tune a model really effectively using data that's readily available to you. The second important thing to notice is that Andrew owns a lot of blue shirts. Now that our data sets are downloaded. We can move on to initializing our models. If you recall, a diffusion pipeline consists of a text encoder model, a variational autoencoder, and a U-Net model that is used for the actual denoising process. We can initialize all of these along with our tokenizer by using the trainer.iinitialize models method. Adding random noise to our images at different time steps is an essential part of the diffusion process. Because of this, we're going to need a noise scheduler to generate our noise. Hugging Face's diffusers library provides a lot of really nice abstractions for accessing these kinds of schedulers. Now, we're almost ready to begin training our model. But first, we need to talk about LoRA, the technique we'll be using to actually fine-tune the model. One of the key challenges behind fine-tuning large models is that weight matrices are at the risk of sounding obvious, really large. If your weight matrix is of size MxN, then you may need to compute a gradient matrix that's also of size MxN for every single update. With many optimizers such as Adam with momentum. You actually compute multiple such matrices, leading to a very large memory footprint. However, the update matrix that you generate during the fine-tuning process also tends to be really low rank. That means that the rank of the matrix, or the number of linearly independent columns, tends to be much smaller than the actual number of columns in the matrix. Intuitively, this tells us that we can probably capture a lot of the important information for updating our weights in smaller matrices. This is essentially what LoRA does. Instead of computing a full MxN update matrix to update our weights at every single step of the training process, LoRA instead trains a new set of smaller weights, which we call adapters, which are then added to the original weights. Mathematically, this is what LoRA looks like. In a normal training update, the weights at the next timestep look like the weights at the current time step, plus an update matrix where the weights and the update matrix are both matrices of size MxN. In LoRA, the weights at timestep T are equal to the original weights, plus the product of two matrices A and B, taken at timestep T. Now matrix A is an NxR matrix, while matrix B RxN matrix and R is less than the smaller of M&N. Essentially, we take two smaller matrices and we train them instead of training the original weights. These smaller weights can then be added to the initial weights to achieve the effect of fine-tuning. Besides the reduced memory footprint, one of the nice parts of LoRA is that you can have many different sets of LoRA adapters that you can switch in and out easily. So, for instance, if you later wanted to fine-tune a LoRA adapter for a photo of a different person, you could swap them out for your Andrew adapters very easily. In this example, we're only going to be tuning the U-Net model, so it's the only model we need to prepare for LoRA. In addition, we also need to initialize our optimizer and extract the parameters we're going to be optimizing. Finally, we need to initialize our training data set our training data loader and our learning rate scheduler. We'll then load all our relevant models along with our data sets and to Accelerator, which is a Hugging Face library that allows us to train more efficiently. Now let's get into our actual training loop. First, we need to calculate our total batch size. We're going to use gradient accumulation in this pipeline. This is a technique where we only update the weights every few steps. Because of this, we need to set our total batch size to be our number of gradient accumulation steps times our single step batch size. Now for the training loop itself. First, we want to set up the progress bar for tracking our training. The diffusion process has a few steps. First, we want to convert our image into its latent representation. To do this, we pass the pixel representation of our image into our variational auto encoder. Then, we sample from the latent distribution. Once we've converted our image into its latent representation, we then need to sample the noise that we're going to add to the latents as we train. Finally, to complete the forward diffusion process, we need to add our noise to random timesteps. Now that the forward diffusion process is done, we need to perform reverse diffusion. To do this, we need to get the embeddings of our prompt and use the U-Net model to predict the noise in the image. Once we have our prediction, we can calculate our prior loss and our instance loss and add them with the prior loss weight to calculate the prior preservation loss. Once our loss have been calculated, we can perform backpropagation and step our optimizer as you would in any other training loop. Now, at the end of each epoch, we want to log our loss metrics to Comet. At the end of our training, we want to save our LoRA weights, log our parameters to Comet, and add a little tag so that we remember that this was a DreamBooth training project. Then, we can call end training on our accelerator and we're done. Again, we're not going to actually run this training code in this environment, because on CPUs, this will take quite a long time. However, we do have an existing Comet experiment taken from a training run of this exact same model, which we can use to analyze the results. We can call display on this experiment and see the metrics that it logged. to Comet. Now we can edit the layout to get a better look at our most important metrics. In particular, we want to see how the loss, the prior loss and the instant loss all compare. Throughout this course, we've emphasized the importance of getting hands on with your image data and visually inspecting your model's output. This project really drives this point home because if we look at these loss metrics, you'll notice some really interesting data. First, you can see that the loss seems to overcorrect back and forth between the prior loss and the instance loss. The peaks in the prior loss tend to correspond to valleys in the instance loss, and vice versa. This kind of seesaw effect might make you think that the prior preservation loss metric didn't really work, as the model doesn't seem to converge. However, if we look at the outputs of the model, we'll see a very different story. To illustrate this, we're going to use our model to generate a bunch of different images of Andrew and a bunch of images that don't include Andrew. To start, let's set up some prompts and some validation prompts. You can see we have several prompts that are targeted at generating an image of a man who looks like Andrew. And then we have several equivalent prompts that remove the "[V]" token. Once we have our prompts in place, we want to create a diffusion pipeline out of the model we just finished training. If you look, we initialize a pre-trained diffusion pipeline using the stable diffusion 1.5 model. Then we load our LoRA weights onto that model. With our pipeline, we iterate over each prompt, generate an image, and logging it to Comet. Again, we're not going to run this code here because it will take some time on CPUs. But we've already run this on Comet, and we've stored the results in an experiment that we can visualize below. We can load our existing experiment like this, and then we can view the images that are logged by passing the tabs equals images argument to the experiment. display method. This is what the actual images tab in the Comet dashboard will look like. In it we can see the images that were logged by the pipeline. If we click on one of them, we can see that this was one of our validation prompts as there's no '[V]" token here. Looking at the image, it makes sense. It looks like a normal man. If you find its equivalent instance prompt, however, we see that when we add the "[V] man", the mural that is generated looks a lot like Andrew. Similarly, this image of a "[V] man" playing basketball looks a lot like Andrew. Whereas this image of a man playing basketball doesn't look like Andrew at all. Even though the loss curves wouldn't tell us that our model had improved by looking at the outputs, we can see that it clearly is learning. The key for us is that we visually inspect the outputs of our model, no matter what the loss curve is saying. Oftentimes, we'll be surprised by the results. That brings us to the end of this lesson. You're encouraged to take things a step further and try experimenting with some other hyperparameters. You can also try to tune a larger, stable diffusion model using a GPU environment like Google Colab. You can even collect the data set yourself or try targeting a different token. Using the code provided here, you should be able to take on a diverse array of projects.