Building Multimodal Search and RAG

Loading...

Welcome back!

We'd like to know you better so we can create more relevant courses. What do you do for work?

Subscribe to receive AI news, events and course updates from DeepLearning.AI!

Course Syllabus

AI Python for Beginners is a sequence of 0 connected courses. You can navigate to the other courses by clicking on the cards below

Explore Courses
Community
My Learnings

You’ve achieved today’s streak!

Complete one lesson every day to keep the streak going.

Su

Mo

Tu

We

Th

Fr

Sa

You earned a Free Pass!

Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.

In this lesson, you learn the concept of multimodal retrieval augmented generation by mixing what you want to search together with language vision models. Then, you implement the full multimodal RAG process using Weaviate and the large multi-model model? All right. Let's get coding. Even if all the objective knowledge, the problem of large language models is that they have no information about data that wasn't presented to them during training. So if you try to prompt them for information that they don't have, though, it is said that they don't know or what is even more likely they would just hallucinate and make up an answer, which is probably even worse. A potential solution to this, is retrieval augmented generation or RAG. Here, what you do is instead of just providing the language model with a question in the form of a prompt, you give it a question along with retrieve relevant information. Now the model can perform this retrieve, then generate operation where it can read relevant information before you has to answer your question. And now the output is customized to the information that you provide it. Typically, if you want to scale up your RAG resolution, you want to put all your documents into a vector database like Weaviate. Then you can retrieve the most relevant documents from the vector database using the prompt and pass those relevant documents into the context window of the large language model together with the prompt. This way, you help your language model to generate a response based on the provided context. But as you already saw that a vector database is capable of retrieving a lot more than just text. So let's take advantage of the multi-modal knowledge base with Weaviate. To store and search through images, video and text. In this diagram here, you can see retrieval of an image from our multimodal vector database. Passing that image along with text instruction to a large multi-model model, you will get a response that is grounded in the multimodal understanding of the world. This process is known as multimodal retrieval augmented generation. Because you augmented the generation we retrieval of multimodal data. Let's now see all of this in practice. In his lab, you use images and text as input, then you get alarms to reason over. It does completely the full RAG workflow. Go, Let's go and count. So, like in the previous lessons, let's start with, a little command that will ignore all the unnecessary warnings, and we're good to go. And now what we need to do is just load the necessary API keys, and we will actually going to use the combination of the two keys that we used in the previous lesson. So we'll have the embedding API key and the key that we use for our vision model. And now that you have the required API keys, it's time to connect to the Weaviate instance. And this time, what are we going to do is, use this special backup system. And that's because we created a data set with 30,000 images pre vectorized. So that you can import them, really fast without actually having to wait for that. And in order to restore those images that we promised, we basically have to run this little command here, where we specify the resources, where we want to go. But the most important thing is that the collection name, where we will load the new data set is called resources. So we can execute this. And this will take around five to 10s. And we should be, ready to go. And now what we can do is very quickly preview, the number of objects that we have in the collection. So we are getting the collections object and running an aggregate function that basically counts all the objects inside and grouping them by media type. And then now we can go basically printed based on, what we get per group. And running this, we can see that we have over 13,000 images and then 200 videos. Won't necessarily use the videos in this lesson. but you can try to query them later if you want. And now we're getting to the fun part of running the full multimodal RAG in two steps. So the first step will be to send a query and retrieve content from the database space in a query. So we will do it as a function called retrieve image. So given a query we want to get an image. So in the first part we basically have to grab our resources collection. And now that we have our resources collection, basically we're calling a new text query. So given a query from the function, we are also providing a filter because in this case we only want to get images, that will pass into the vision model later. And, and really we are only interested in a path to the image, and we'll return just one object. And then once we get the result back, we'll grab the first object, grab the properties, and we'll return from this function just the path to the image that we're able to find, with the new text query. So in summary, if we run this given a query, we should get back an image URL. that matched our query. And now you can test the retrieve image function. So how about trying a query like "fishing with my buddies." and then if you run this, you should get something like this. like, you see in here, we have a man holding a fish, and probably the dog was recognized as the body of the man in the picture. So if you run a different query and your query is not exactly represented in a data set, as the query, you may get surprising results. But don't get discouraged by this. That's part of the game. So feel free to play with different types of queries. Just a little caveat. The kind of things that you will get back depends on what's already in the data set. So if you're searching for something that's not in there or not exactly represent, you might get some surprising results. So now you're done with the retrieval part, and we can move on to the generative part. And now for the generative part, you are going to follow the same steps as you did in the previous lesson. So you need to set up the API key for the generative model. And also like you did it in the previous lesson, to set up the helper function to convert output to markdown and the call the LLM function which, given an image path and a prompt, can generate a nice description of the image. And finally, to complete the loop, you're going to call the LLM function. Given the image path from the retrieval segment. So that was from step one, and the description. You should be able to execute this, which takes about a few seconds. And you should get back a description of the image of a man holding a fish with a dog next to him. And you can see in here the description I got, it talks about a man with a green hat and a khaki vest holding a large fish in his hand, etc., etc., and even talks about the dog standing next to the man. You probably will get a different description than I did. But that's part of how the lens generate, responses token by token. And now you can combine all of it together. So let's create an MM RAG function where the first step will be to call the retrieve image function. And then the output will be set inside the source image variable. And the second step will be to call LLM with the source image from the previous step and the prompt. And that should return the description. So let's execute this. And finally you can call a full mirror function. So you can search for something like paragliding through the mountains. And that should both grab an image. And also at the end provide a description of the image just like this. And voila. That worked pretty nicely. So we have a nice picture of someone paragliding. And there is a description of someone, paragliding over lush green mountain. I'm sure you'll get a very similar result. And just like this, you are able to combine, two different parts from lesson two and three, the retrieval and the generative part, in order to actually get a multimodal RAG function. And now the widget instance is closed. Cool. So in this lesson, you learned how to combine the retrieval together with generative models. Even though these two were two completely different models, you were able to actually build something that combines into one big functionality, which gives you a lot of power. And in the next lesson, you learned how to take this into industry applications and try this on many different real life use cases. See you there.

course detail

How Was Your Experience

Thank you for taking the time to provide feedback on your course experience! Please take a moment to rate the course and share any comments you may have.

Would you recommend this short course to people in your network? (0=Not likely, 10=Extremely likely)
012345678910
Feedback about the Course:
Feedback about the Platform:

Loading...

Learn Code

Next Lesson

Building Multimodal Search and RAG

Introduction

Overview of Multimodality

Multimodal Search

Large Multimodal Models (LMMs)

Multimodal RAG (MM-RAG)

Industry Applications

Multimodal Recommender System

Conclusion

Appendix - Tips and Help

Course Feedback

Community

0%