In this lesson, you will explore multimodal data and learn how to use the BridgeTower model to embed image-caption pairs and a multimodal semantic space. You will work with real-world image-text datasets and explore how to measure similarity between different examples. Finally, you will use UMAP to visualize these high-dimensional embeddings. So let's dive in. So in this lesson, we are going to learn how we embed vision language data in a common multi-modal semantic space. To process videos, we first need to be able to embed vision language data to a common multimodal semantic space. Imagine you have this image-text pair about a cat. This paired multimodal data can be processed by a multimodal embedding model to obtain multimodal embeddings. Specially in this lesson, we will be using a model called BridgeTower to embed the image text data into a 512 dimension vector in a common multimodal semantic space. A property of this multimodal semantic space is, that if we were to take another image text pair about a cat, then its embedding will be close to the first embedding. Also, if we take a different concept, for example, this image text pair about a car, then its embedding will be far from the embeddings of the cat. In our notebook section, we will learn how we can measure how close and far these embeddings are. Since it's not possible to visualize a 512 dimension space, in the notebook, we will also project these embeddings into two dimensions using UMAP, and experimentally observe how the multimodal embeddings of cats are clustered together and further apart from the embeddings of cars. We will be using the BridgeTower model during this lesson. This model was developed by our Multimodal cognitive AI team at Intel Labs with collaborator in Microsoft Research. The architecture of the model should remind you of a bridge. The left tower is a text transformer that will process text tokens. The right tower is a vision transformer that will process image patches. To combine the two modalities, we introduce transformer blocks that connect the hidden representations of the text and vision transformers. The cross-attention blocks here fuze together representations of text and image and give us cross-modal joint embeddings at the output of the model. In our multimodal RAG system, we will be using the BridgeTower model for embedding video segments and its associated text. So now let's go to our notebook exercise. Welcome to the notebook for lesson two. In this lesson, we are going to practice creating multimodal embeddings from image text pair data using the BridgeTower model. Let's start by a small example of three image text pairs defined by this dictionary. The three fields of the dictionary point to the URL of the image. The text of caption associated with the image. And the path that we want to download this image to. So we will loop through these images and download, the image and text data. So now let's take a look at these downloaded images. So we see that the first image is about a motorcycle and its caption. The second image is also about a motorcycle and its caption. But the third image is about a cat and its caption. Now we want to process the image text pairs through a multimodal embedding model and convert these into multimodal embeddings. For the BridgeTower model, we are going to use API's calls from Prediction Guard, which has hosted this model on Intel Gaudi two AI accelerators and the Intel Developer Cloud. Particularly the function bt embeddings from Prediction Guard will invoke this model. So now we loop over the three images. We encode each image as a base64 image, and we call the BridgeTower model on it. We collect all of these embeddings and this list. We notice that each embedding is of a dimensionality of 512. So that means each image text pair has been converted to a vector of dimension 512 in this multimodal semantic space. So how can we measure similarity or distance between these embeddings? A very popular method of doing that is by measuring the cosine similarity between two vectors. Cosine similarity measures the angle between two vectors and this function, called cosine similarity, will take two embedding vectors, it will compute the dot product between them and normalize them by the magnitude of the two vectors. We will now compute the similarity between the first two examples and between the first and the third examples. Let's display what the similarity scores are. And we see that the similarity between the first two examples is much higher than the similarity between the first and the third example. And this is expected because the first two examples were image text pairs related to motorcycles, but the third example is an image text pair related to a cat. So we see how by measuring cosine similarity, we can figure out how close or far apart these multimodal embeddings are. Other methods of measuring distances between vectors and high dimensional space. So here Let's also compute the Euclidean distance between these vectors by computing the L2 norm. We use a function in the cv2 library to compute the L2 norm. Here also, we find that the distance between the first two examples is smaller than the distance between the first and the third example. There are other ways also of computing these distances. The cosine similarity is a very popular method because computationally it is very cheap to compute that distance metric. And it's often used in loss functions for such models. So when working with large number of image text pairs, it can be useful to visualize the distribution of the multimodal embeddings. Since it is not possible for us to visualize a 512 dimensional space, we will first project these embeddings into two dimensions. Such analysis and visualizing is very useful to make sense of your data, and it can help uncover hidden insights. We will use the UMAP transformation to project the 512 dimension vectors to two dimensions for visualization. So let's prepare some data with 50 image text pairs of cats and 50 image text pairs of cars. Let's take a look at this data. So we see that the first few examples, we have images of a cat with the caption. And then we have images of cars with their captions. So just like before, we will compute embeddings using the BridgeTower model on all of these images. Specifically, we will loop over all the cat images, convert them into base64 images, and then supply the text and base64 image to the BridgeTower model. We will collect all of the cat embeddings in a particular list. And we will do the same thing for all the car embeddings. This will take about a minute on your computer. So now that we have all the embeddings collected, we are going to define a function that will compute the UMAP. So this function will take as one of the input these embeddings which are in 512 dimensional space. And it will use the UMAP transformation to project this data into two dimensions. We will convert this into a pandas dataframe. And we will also add a column for labels, so that we can label data appropriately to a particular class. Let's stack up the embeddings of the cat and the cars datasets, and we will also create a label that indicates cats and cars for each of the data points. And we will call our dimensionality reduction function. So in the output we see that we have 100 examples. 50 are for cats, 50 are for cars. And each of the 512 dimension embedding is now reduced to two dimensions. And then x y space. So now we can go ahead and plot this data. And we see that in this two dimensional UMAP space, we see a clustering of orange points that belong to the image text data of cars. And we see a clustering of the blue points which belong to the image text data of cats. So as we come to the end of this lesson, we discussed how we can prepare image text pair data, and compute multimodal embeddings on them using the BridgeTower model. We learned how using cosine similarity we can compare multimodal embeddings. We also learned how we can do dimensionality reduction and visualize these multimodal embeddings in a two-dimensional space to understand aspects about your data. At this point, please experiment with your own images and captions. Once you have done that, I will see you for lesson three.