In this lesson, you'll go over some considerations for using it as a live web API, like streaming and personalized chat sessions. Away we go! You'll pick up where you left off in the last lesson with loading and splitting our CS229 transcript into a vector store. For brevity, the previous steps have been factored out into this helper method and we'll perform the same process of loading, splitting, and then initializing our vector store with those documents and then converting the vector store to Retriever for use with our expression language chain. And as a reminder, we're using the OpenAI embeddings internally within this helper function. Cool, so let's load the pieces of our conversational chain afterwards. And again, we've factored out the document retrieval and rephrase chains into their own helper functions as well. Just reduce the amount of kind of boilerplate. So now let's load our document chain. That's wrapping our retriever. Let's load our rephrase question chain, which will take our follow-up questions and rephrase them as standalone questions free of references. And if you recall, the final sub-component of your conversational retrieval chain was this answer synthesis chain, where we take all of the previous steps and all the information we've gathered and turn it into a final output. And we're gonna change something very small about this. So we're gonna construct it from scratch, using the same prompt template there. You're an experienced researcher, expert in interpreting and answering questions based on provided sources. And then our prompt template, where we have a placeholder for chat messages, a place for the standalone question from the rephrase step, and our document context from the retriever. So before you assemble all these pieces together, note that the native web response objects used to return data in popular frameworks like Next.js accept a readable stream that emits bytes directly, rather than a readable stream that emits strings. So previously our chain outputted those string chunks using the string output parser, but it would be convenient to be able to directly stream bytes. So we could pass our langchain stream directly into the response from our server. And fortunately, LangChain provides an HTTP response output parser that parses chat output from chat models into chunks of bytes that match a variety of content types. So to use it, let's construct our conversational retrieval chain as before, but skip that final string output parser step. So we'll import our modules and then create a runnable sequence exactly like the one in the previous lesson rather, minus that last step of parsing the chat output into a string. So now we'll create that HTTP output response parser, HTTP response output parser rather, and we'll create our final retrieval chain that takes into account message history. Cool. So, with the HTTP response output parser, we're going to pipe the final output of our message history handler wrapped conversational retrieval chain. We're going to pipe the output through to this new response parser, which will stream back bytes for our server. And the reason we're piping this into the output parser at the end rather than in the middle of our conversational chain is because our history manager here requires either a string or chat message to the final output, rather than the bytes that this will convert our final stream into. So the other thing to consider here is because we're going to be running this chain in a web environment, where many users could be accessing our endpoint at the same time and streaming responses, we're going to want to not reuse the same message history like we did in the demo in the previous lesson, but instead create a new message history per session ID for each individual user. This way, messages from different users will not get mixed up. So. Additionally, we'll want to bear in mind that users should not share chat histories, and we should create a new history object per session. So to do this, we'll define and override this get MessageHistory function here with one that takes into account previous sessions like this. If there is a message history existing for the session ID already, we'll just return it. And if not, we'll create a new one, signifying that it's a new user and we shouldn't take into account any messages from other users. And then we'll recreate our final chain with this new method, just like that same wrapped runnable just with a different get message history function here, wonderful. So now let's set up a simple server with a handler that calls our chain, and let's see if we can get a streaming response. So we'll set a port here and instead of. Let's try 8080, and then we'll create a little handler here that wraps and calls our chain in streaming fashion. First parsing our question from a body, an HTTP request body, and a session ID also from the body. And then we'll create a stream using the.stream method from expression language on our chain, and then pass that directly to a response, so that our server will stream back a response. In a true production deployment, you'd likely want to set up authentication or input validation via some middleware, but we'll skip that for simplicity for now. Now let's start our server. Since we're running in Deno, we'll use a Deno built-in HTTP method, but this general concept is shared by many js frameworks. And we'll run it on our port and serve or run our handler whenever we get a request here. Awesome. And now we're live on localhost 8087. So a few more housekeeping things before we get too far into it. Let's make a little few helper functions that'll make handling streaming responses a little bit easier from our server. One will be to allow us to read chunks in a asynclterator fashion. So we're going to keep reading based on a passed in reader until there are no more chunks to be read. And a sleep function to ensure that our notebook doesn't shut down execution before the request finishes. Now we're ready to call our endpoint. We'll use the web native fetch function here to make a HTTP request to that server we just spun up. Await fetch, and we're gonna give it our port, which we'll type in there, and we'll make the method post. And if you're familiar with the web or kind of dealing with requests and responses, this will look familiar to you, but we need to specify the type of body that we're going to pass. And because earlier, if you recall, we expect a JSON body, we need to make sure that our header here has a content type of application JSON. And then our body is going to be a parameter for question, a parameter for our session ID, which in again, in prod, we would do a little bit differently. We wouldn't just hard code this. We assign it somehow based on the user accessing our endpoint. And we stringify it to match web standards. Now let's add some code to consume the streaming response. So we get a reader from the readable stream returned from the server. And then we'll use this asynciterator syntax to get each chunk and log it. And finally, the housekeeping bit, because we're in a Jupyter notebook, we want to make sure that our request finishes before the notebook cuts off cell execution. Now we're ready to call the endpoint. Let's do it. And because we use a similar question here, what are the prerequisites for this course? We get a similar answer. The prerequisites for this course encompass a few fundamental areas of knowledge, solid grasp of basic probability statistics, which you might recall were in the source material, and some other reasonable looking outputs. So this is great, but now let's test the memory by asking a follow-up question. So we use the same sort of fetch pattern as before, where we await fetch, and we'll plug in the port properly with the same method, headers and body just with a slightly different follow-up question. So instead of asking directly, what are the prerequisites? We'll ask the follow-up, can you list them in bullet point format? And we'll use the same code as before to read the readable stream and parse it via this asynciterator and we'll log the individual chunks. And then we'll do the housekeeping step of sleeping to make sure that the request finishes before our notebook shuts things down. Let's run it. Sweet. And now we see a nice conversational follow-up where it rephrased the prerequisites for the course in bullet point format. Familiarity with basic probability statistics, assumed knowledge of concepts like random variables, etc. In bullet point format as prefixed by this dash. So great, this is a key step in making client apps more responsive because you can start showing these chunks in the front-end before the entire response finishes. And let's try again with a different session ID to make sure that we're not going to get cross-contaminated messages. So again, very similar fetch request format here. Localhost to our port. And instead of our previous session ID as a body parameter, we're going to pass in a new session ID. And we're going to ask, what did I just ask you? So in this case, we should not expect to get any kind of reference to our previous chat history with the first session ID. And let's see if that works. Based on the provided context and chat history, it seems you did not ask a specific question as explicitly mentioned in one of the texts. Awesome. So we didn't pull in anything from the previous conversation. That is unique to that user. And in this case, we get a separate wholly new chat history. That's the end of the lesson, but I would encourage you to play around with what we've built by asking different questions, trying out different session ideas and getting a sense and comfort level for how you can implement this lesson in your own projects.