This last lesson is an intro to haystacks core abstractions. You'll learn what components are, how they work, and how they can be combined into pipelines. So many are use cases as well as what documents stores are and how they can be accessed by a pipeline. We'll start by creating a simple indexing pipeline followed by a token search pipeline. Let's dive in. AI applications are often made up of multiple steps that work together to achieve a goal. For example, retrieval. Augmented generation. This is usually made up of two steps. The retrieval step that is, looking at a database and extracting the most relevant documents from that database. And then the generation step that's using the context from these documents to generate a response. But you may have to sometimes switch things up. For example, you may have to add a ranking step between retrieval and generation. In a full AI application, there are smaller tasks that are combined into a larger use case in haystack. All of these small tasks are achieved by components. A component is then combined with other components to form a pipeline. These pipelines are the entities that are achieving an application that we want to build. A pipeline also has access to some databases. In haystack we call these document stores. And the pipeline has access to these document stores through some of the components. For example, you may store data and we V8 quadrant or MongoDB and have your pipeline access these documents through some components. Pipelines are composed by connecting these components in a way that achieve an AI application. But a component can expect any number of inputs, and it can also produce any number of outputs. For example, the sentence Transformers document embedded. This component expects a list of documents and returns the same list of documents enriched with embeddings. It also returns some metadata. This component uses sentence transformers embedding models to create these embeddings. For example, let's start with a sentence is Taylor Swift Queen of the world? We may have a component, let's say an embedding that produces an embedding for this query. We may have another component that's expecting an embedding or a vector. And it's then producing a document. Let's call this a retriever. If we then combine these two components we've achieved a pretty accurate document search pipeline. For example. Haystack has many ready made components. For example, generators that access different model providers, embeddings that does the same. We have different retrievers that are able to retrieve from many databases. We have converters, but the list goes on. you may have rankers, routers, preprocessors and so on. We build pipelines to create applications like question answering, document search chats, question generation, output validation. And the list can go on and on. We build these pipelines with haystack. But haystack pipelines can also branch out. That means we can have locations that have a decision making component. Pipelines can get quite complex. Haystack pipelines can also loop so it loops until a certain case is met and so on. But the most important thing is, if haystack doesn't have a component that you need to build your application, you can go ahead and build your own and put those into the pipeline where they make sense. All right. Let's see all this in code. And let's start using haystack components, pipelines and document stores. You'll first start by adding this line to suppress any warnings that we don't need. Next, we're going to use a helper function that will allow us to import all of the environment variables that we're going to need for this lab, such as OpenAI API keys, and so on. Now let's start creating our first component and using it. The first component that you'll use is the OpenAI document and beta. This component uses OpenAI embedding models to create embeddings for documents. First, you'll import the document embeddings and then you'll initialize a document in beta. In this case, you'll be using text embedding three small as your embedding model. You can also inspect what kind of inputs and outputs this component is expecting. For example, we can see that this component is expecting a list of documents as input. And it's going to be producing documents and matter as output. We already know that this is an embedding component. So the documents that it's going to be producing will also have embeddings. Now let's see how we would run this component. For this we've created some dummy documents. have a list of two documents. And each of these documents are mentioning things about haystack and how you can build AI applications with haystack. Because we know the embedded component is expecting documents as input. We can run it with these documents. When we run this component, you'll notice that it's produced a list of documents as outputs. And the meta information is also telling us what kind of model was used to create these documents. And the vectors of size 1536. Now that you know how to use a component on its own, let's see how we would use it in a pipeline. First, we're going to start with initializing a document store. And then we're going to build a pipeline that writes documents alongside the embeddings into a document store. For now we're going to be using the in-memory document store. This is the simplest document store you can use in haystack, and it has no requirements to use it. But if you like, you can switch this to any documents store such as quadrant V, V8, pinecone, Croma, and the list goes on. Now we have an in-memory document store. Let's write our first txt file into this document store. We're going to start by importing all of the component s you're going to be using for the first indexing pipeline. You're going to be using the open AI document embedded again. But you're also going to be using some preprocessors and converters as well. For this demo, we're going to be using all of the default variables for these components. But you can also switch this up. We'll start with a converter because we're going to have a. TXT file about Da Vinci that we're going to be writing into our in-memory document store. Next we're going to be using the document splitter The document splitter is a component that chunks up your document. By default. It splits by 200 words. And we're going to be using this default setup. However, if you like you can change this. for example you can say split by and decide to split by passage instead. This will split your document by 200 paragraphs, for example, but let's use the default values. Next you're going to be using them better. In this case we're going to be using the OpenAI document in beta. The final component you're going to be using is a document writer. You already have a document store. So here you'll be telling the document writer that it should be writing into your in-memory document store. Now that you have all of your components, the next thing you have to do is create a pipeline and add these components to that pipeline. You do this by initializing your pipeline, and next you'll be adding each component. The important thing here to note is for every component, you'll be providing a name, you can name your component anything you want, but then you have to make sure that you're using this name. Next. Now that you've added the components, your pipeline, the pipeline has access to the components, but it doesn't actually know how these components interact with each other yet. So this haystack uses component connections. For example you'll start by connecting the converter to the splitter. This is basically telling the pipeline that the output of converter should be handed over to the splitter. Let's connect all of our other components as well. And when you run this, you'll also get to see what kind of connections you've created in your pipeline. Once you've connected all of your components in the pipeline, you'll also be able to observe what components your pipeline has access to and exactly what the connections between these components are. So we're basically seeing that converters documents output specifically is being inputted into splitter documents input. You'll remember from before that the embedding was expecting documents as input. So it's being given documents by the splitter. But it's also producing documents. Only this time it has embeddings. So the documents output of the embedding is now being given to the writer. Documents input. Another utility that haystack provides is a way to visualize these pipelines as well. You simply have to call that show on your pipeline. And you'll get a graph of exactly what your pipeline looks like, including all of the connections. In this case, we know that our pipeline starts with a converter. And the thing the input that it's expecting are sources. Now that you have your pipeline and you know that the connections are accurate. You can try to run it. You already saw that the first component in that pipeline was the converter component, and that it expects a list of sources for this lab. We have a. TXT file about DaVinci which will be using to index into in-memory documents store. So we call run on our indexing pipeline. And we tell the run method that the converter is expecting inputs sources. And we're also providing at the location of our txt file. We run this pipeline and you'll see that it's calculating the embeddings with the default embedding model we have with the OpenAI document embeddings. And it's also letting us know that it's written 47 documents into our in-memory document store. To check. You can also inspect your document store as is now. For example, we can filter our document store with no filter and simply inspect the contents of the document at index five. And here we have it. Now that you've created your in-memory document store with documents about DaVinci as well as the embeddings, you can create your first document search pipeline. So this will first import all of the components we're going to be using. And then we start creating the components that we want to use. The important thing here is that because we use the OpenAI document embedding with the default embedding model, we know we're going to have to use the same model to embed the incoming query from the user for this use case. Then we're going to be using the OpenAI text in beta. And we're calling this aquarium better. Next we need a retriever. We use the in-memory document store. So for this case we're going to be using the in-memory embedding retriever telling it to retrieve documents from our documents, store. Next, we initialize our pipeline. We call this pipeline document search, and we simply add the components we have to that pipeline. Finally, as we did before, we're going to be connecting our components. I'll start with a connection that's actually going to be incorrect or it's not very clear exactly what the connections are. So if you run this, you'll notice that you get an error. These types of errors are actually quite useful. This is telling us that there's a pipeline connection error because it doesn't know exactly how the query embeddings should be connected to the retriever, because there are multiple ways that these two components could connect. It also gives us some recommendations of what these connections could look like, and lets us know what the outputs and inputs of both components are To resolve this, we simply have to let the pipeline know that specifically the embedding output from the query and better should be given to the query embedding input of the retriever. And there we are. We now have a document search pipeline. Again, you can use the show utility to make sure that you've created the pipeline that you expect to see. So you can see that the first component query embedding is expecting text. This is going to be the question, for example, that the user is asking. This component is then going to output embeddings to the retriever, which is then going to return the most relevant documents. Let's run our document search pipeline. Let's start with the question how old was da Vinci when he died? we then run our pipeline and assign the results to results. We can then also have a full loop so that we can see exactly how many documents were retrieved, and also print out the contents of these documents. As you can see, there are quite a lot of documents here. And because we ran our pipeline with default variables, we have ten of the most relevant documents. Another thing you can do is now run this pipeline, but modify the inputs to various components at runtime. For example, instead of asking for ten of the most relevant documents, we can switch this to three. The only thing we need to do is modify the top k input of the retriever. As you can see when. Now adding inputs to the retriever component and calling top k three. And now instead of ten, we have three of the most relevant documents. All right. In this lab you learned about components, how you can run components individually, but also how you can combine components into full pipelines. You've built pipeline that indexes documents into your in-memory document store as well as a document search pipeline. You can try playing around with how you split your documents. Instead of having 200 words, for example, you can split by different lengths and you can also try modifying the question you're asking your document search pipeline, as well as the top K for the retriever. In the next lab. You're going to be using this knowledge to build your first retrieval augmented generation pipeline. And also you're going to be customizing the behavior of these pipelines. All right. See you there.