In this lesson, you learn how a concept is understood across multiple modalities and then implement a multimodal retrieval using Weaviate, an open source vector database. You build a text to any search as well as any to any search. All right, let's have some fun. Even though you can't hear this video, you can probably hear the lion roaring in your mind's ear. This is because humans are very good at inferring information across modalities. So good, in fact, that we can even do it in the absence of the other modality. Here, being the missing sound. Or in the case of this video of a train passing by, you can probably imagine the choo choo sound or even feel the wind or the ground shaking. This is because we are very good at understanding information from all of our senses. What did we see, hear, feel or smell? And we can understand the very same data point using multiple senses from multiple modalities. And the same goes when you hear the sound like this one. You know, straight away that this is a cash register. This is because your multimodal reasoning works in all directions. And the way a machine can gain a similar understanding of multimodal data is by creating a shared multimodal vector space where similar data points, regardless of their modalities, are placed close to each other, And with the unified multimodal embedding space, you can perform text to any modality search. For example, you can search for "lions roam the savannas" and you response get back multimodal data that represents the most similar content, whether this is text, images, audio, or even videos. You can even perform any to any search. Where your query can be in any modality and your retrieved objects in all available modalities as well. For example, you could use an image of a lion or a video of a lion to match all those matching objects. Now let's go over step by step how multimodal search works. First, if we take a king of the jungle, run it in a MM model, we'll get a vector embedding back. Then we could also take a video, pass it to MM the model, and we get another vector embedding path. And you can already see that this vector embeddings are pretty similar to each other. And as we go, we could be loading millions of objects, if not billions, and eventually create a whole vector embedding space of all our vectors. And for example, we could take a picture of a lion and pass it through an MM model, which will give us a new vector embedding. We can then use that vector embedding to point into our vector space, would it show us where all the similar objects are and response get another object like this bunch of lions running. Let's now see all of this in practice. In this lab, you add multimodal data to a vector database, then perform any to any search. All right. Let's code. So here again, let's have the function to ignore all the unnecessary warnings. So now let's load the necessary API keys to run our large multimodal model embeddings like this. And now let's connect to Weaviate. And for this we are going to use the embedded version of Weaviate, which allows us to run them the database in memory. And we need two things. We need our multi to like module that does the multi-modal victimization and also helps us with search. And then in here we need to pass in the necessary key for that model to work. So let's run the connection, and we are connected. And don't worry about these messages that you see here. Those are just information messages to us. We are connecting to the database and about everything that is happening underneath. And now let's create the collection that we'll use for storing our vector embeddings and all our images, etc.. And we'll call it 'animals'. And what do we need to add next is a vector razor. So this is a multiple vector razor, so that can work with both text and all the different modalities and we are telling Weaviate that for images, we will use the image property and for vectorizing video fields will use video property. And then finally, we need to add additional information like where the project is. And the most important one is what's the model that we are using. So for this lesson we'll be using the multimodal embedding 001 and we want to get all our embeddings of 1400 dimensions. And then before we run it, I'm going to add this little conditional function so that if you ever need to rerun it, basically we will check if the collection animals exists so we can delete it and then recreate the whole thing. The only thing is that be careful because if you delete a collection, you lose all the data in that collection. But let's run it now. And now we have an empty collection. And now let's add the helper function which we've given path will give us a basically for representation of the file there. And we need this function because that's how we pass any image or video file into an embedding for vectorization. So now that we have an empty collection and a function to convert images to b 64, we can go and start inserting the images. So we need to first grab our animals collection and then these animals, now, who's going to be our way of communicating with this specific collection. Next that we want to do, is find all the images inside the source image folder, and then we are going to open a batch process that will limit importing the data to 100 per minute. And we are going to iterate through all the files inside that folder. And then finally, what we can do is basically as we go to every file we can print a message, grab a path to this file. And what we are doing is calling batch load at object, which is basically a way of telling to Weaviate, "Hey, here's a new object and here's the name of the file, here's a path." But the most important part is this, when we converting that image into base64 representation, which would then end up as a vector embedding, and then this, there's a little label that we call media type. This is what we will use later for displaying the images. This is how we know it is an image or a video. And if I run this, this is going live, import all the data. So we'll actually import this object as json, but together with this will generate vector embeddings which will be attached to every single object, would later allow us to actually search through this collection. And finally, let's run this code, which basically goes and checks for any failed object. If and if everything was fine, we should just see no errors. Which is great. We're good to go. So now that we have the images in, it's time to start importing videos. Just a little warning with the models that we are using. All videos should be at least 4 seconds long. Otherwise the vector embeddings that we'll get back are not going to be quite great. So the code is pretty much the same this time. We're just going through the video folder and then basically instead of passing it as an image property, this time we're passing to the video property and the label is also video. This should be a little bit slower than vectorizing images, but it's still not too bad. So just give it a moment and you should be good to go. And one more time. Let's check if there are any errors. No errors. So everything is good. And just to summarize, we should be able to see how many objects we have and of what type. And we can see that we have nine images and six videos. So that's exactly what I expected. So now that we have all the data inside the database, we can start looking at search. But before we get to that, we need a few helper function. So this helper function, allows us to actually print our results in a nice matter, especially this one display media. So given an object, depending on the label, whatever we labeled it as an image one or a video one will display it in an image or a video. So let's run this. And the other set of helper functions are those that allow us to convert an image from url or again, the same one for converting to base64. And now this is the moment you are waiting for and where the fun begins a cleric. So if you grab our animals collection, what we can do is running query of type near text, which basically means that you can use text to search to our multimodal collection. So we are looking for dog playing with a stick and want to get back name path and media type and we want to just get the best results and run it. That runs pretty fast and we can then display all the results. I want to iterate over the objects in there and we use our helper functions from before, like the display media. So we should get some results. And you can see straight away that the first result is a video of a dog running of a stick, which is cool. 170 00:09:32,466 --> 00:09:35,466 The but also, we have a picture of a dog and another video of a dog giving a high five. How cool is that? I want to adopt this dog. All right. So now we've done text search and I think you've done it before, probably. So let's try something harder. How about we use this image as an input for a query? And a query this, is actually very similar, except this time we are calling new image and we are converting this file to base64 format. So we are using the test cat and again return the same properties and iterate over all the objects. And let's execute it. And in response we get this cat, and two more, which basically shows that the search work really well. And now let's try to search for an image that is somewhere on the internet. So we have this image of a meerkat, which you probably have guessed is my spirit animal. And we are going to run a very similar query, except this time we'll call your URL to base64. But the rest of the query is pretty much the same. And if we execute this, we'll get these pictures of one meerkat that looks kind of angry but is all right. Don't worry about him. We have another one. Chilling. But the fun thing is that we also were able to match a video of a meerkat. So with that, we were able to actually do a multimodal search that we used images to find both other images and videos and this is very powerful. And now for something that is probably the hardest task for an LLM, which is to make a video search. So let's try to run the third with this video of these two meerkats just chilling. 206 00:11:27,200 --> 00:11:28,333 And this time what we're going to call is a new media function. And we again, converting that video to a base64 representation. I will tell the collection that this time this is a video that we're searching for. And if we execute that, in response, we get two videos and a picture of a meerkat. And just like that, we're able to actually perform any to any from text to images to video and getting all kind of modalities in response. And in this part, what you want to see is actually how this vector space looks like when we are loading both video and image embeddings and how they actually live on the same space. And for content there are similar. They should be next to each other. So let's start by loading some of the necessary libraries. And probably the most important one here is the UMAP, which allows us to reduce the dimensionality of the vector. So we'll go from 1400 to 2 dimensions, which will allow us to actually plot it as an image, as a two dimensional image. And now what we want to do is load the vector embeddings in a media type from Weaviate. And we are using this iterator function which basically will go to all the objects that we have in our collection and then give us back vector embeddings, but also we can access all the properties like the media type. And you can see here that we are not calling the animals collection anymore. We have a for your benefit because if we just try to visualize a vector spatial 15 objects, that's not going to be very exciting. So we pre-loaded this database with 14,000 images and videos so that we can actually get some better results. So let's run this and then quickly pull all the vector embeddings together with some of the properties that we need. Now what we are going to do is set up a data frame with our embeddings together with the labels, which will act as a series. And this line is what does the conversion from the 1400 dimension to two dimensions. And don't worry, this actually should take a little while. Could be up to half a minute, but after that we should be good to go. And now that we have all the embeddings pre calculated and drop down to two dimensions, we could actually plot them. So this is the function that performs the plotting. And if we run this, we get a nice vector space. And something that I haven't mentioned earlier, the data set that we prevent to write for this exercise actually came from ten different categories. So you can see how, like we said, a similar vector embeddings are always stored very close to each other. You can see ten groups of vector embeddings that are close to each other. Ye can also use the UMAP library to plot an interactive map to see all our vector embeddings. And is the interactive plot. And if you don't see there is on the right hand side a special buttons, that if you click on them allow you to do to perform different functions. If I choose this one I can for example highlight this area and can zoom in in here and you can actually see different types of vectors that I will start together. You can go have fun and review whatever we have on our vector embedding space. And then the final step that we have to do, and that's something that you always need to remember when you're done with this instance. What we have to do is just close it. So you can open it from another notebook. In this lesson, you learn how you could use a vector database with multimodal models, how you can vector write them and stored metadata together with the vector embeddings, and then use text and image and video search across all the modalities. But also we run a nice test and plotted 14,000 different vector embeddings to show how similar vectors are grouped together, even if they come from different modalities. And in the next lesson, you learn about large multimodal models and how they work and how they get trained. So I will see you there.