In this lesson, you'll use what you've learned so far to start building a Knowledge Graph of some Companies are required to file many financial reports with the SEC each year. An important form is the Form 10-K, which is an annual report of the company's activities. These forms are public records and can be accessed through the SEC's EDGAR database. Let's take a look at one. So here, we are at sec.gov and the EDGAR database of all these financial forms. And then, filter down for just their annual reports. As you can see there's ton of sections to look at, lots and lots of text, lots of interesting information in these forms. Industry trends, a technology overview. This is great. So, this is the kind of data that we'll pull into the Knowledge Graph These forms are available for download. And when you've downloaded them, they actually come as XML files. And so, before you can really start doing import with that, And from that, we then also extracted key things like an identifier called a CIK, which is a central index key, which is how companies are identified within the SEC. And for the big text chunks, we then looked at items 1, 1A, 7, and 7A. Those are the large bodies of text that we're going to do chat with. If you want to take a look in the data directory of this notebook, you'll see some of the resulting files after doing all this cleanup. After doing all this work, we turn them into JSON so that it's easy to import and start creating the Knowledge Graph from that. Okay, we're almost ready to get back to the notebook, but before we do, let's think about what our plan of attack is going to be. We saw that each of the forms have different sections of text that we're going to split up into chunks. We're going to use langchain for that, and then we've got all of those chunks set up, each of those chunks will become a node in the graph. The node will have that original text plus some metadata as properties. Once that's in place, we'll create a vector index. Then in that vector index, we'll calculate text embeddings to populate the index for each of the chunk texts. Finally, with all that done, we'll be able to do similarity search. Let's head back over to the notebook and we can get started with the work. So to start, we'll load some useful Python packages, including some great stuff from Langchain. We'll also load some global variables from the environment and set some constants that we want to use later on during the knowledge graph creation part. In this lesson, you'll be working with a single 10k document. In practice you may have hundreds or thousands of documents. The steps you'll take here would need to be repeated for all of your documents. Let's start by setting the file name and then loading just the one JSON file. And then, with first file name, We can take a look at that to make sure that it looks like a proper dictionary. Okay, the type is dictionary in Python. That's perfect. Let's take a look at the keys that are available. I'm just going to copy over a for loop that will go through the dictionary, printing out the keys and the types of values in the object. You can see these are the familiar fields from the Form 10-K, the different sections called Item 1, Item 1A, and so on. And then, Let's take a look at item one to see what the text is like. Let's grab item one from the object. And because I know there's a lot of text there, we're just going to look at a little bit of it. So, Because the text is so big, this is the whole purpose of doing chunking. We're not going to take the entire text and store that in a single record. We're going to use a text splitter from Langchain to actually break it down. And this text splitter is set up with a chunk size of, you know, 2000 characters, we're going to have an overlap of 200 characters. And as before, we'll take a look at what the type of that is. Okay, we can see that the list and it should be a list of strings. Okay, let's also see how long that list is. Okay, so there's 254 chunks from the original text. Finally, let's actually take a look at one of the chunks to see what the text is there. And that looks a lot like what we saw before, which is perfect. Okay, this is kind of a big function. So, let's walk through it one step at a time. The first thing we'll do is that we'll go ahead and set aside a list where we'll accumulate all of the chunks that we create. And then, we'll go ahead and load the file, For each of those items, we're going to pull out the item text from the object. So, With the item text, we'll use the text splitter that you just saw to go ahead and chunk that up. And then, for the data record with the metadata, First, there's the text itself pulled straight from the chunk. There's the current item that we're working on. There's the chunk sequence ID that we'll be looping. And then, That form ID that we just created, the chunk ID that we'll also go ahead, All that will go into one data record. And then, Let's take a look at the first record in that list. You can see the original text as we expected and there's all the metadata as well. Perfect. You will use a Cypher query to merge the chunks into the graph. Let's take a look at the Cypher query itself. This is a merge statement, so remember that a merge first does a match. And then, The parameter itself is called chunk param, and it has names that match across, right? So, names will be added to the chunk parameter names, And for passing the parameters, we have an extra argument, which is called params, which is a dictionary of keys and then values. The key chunk param here will be available inside of the query as this dollar sign chunk param. We're going to give it the value of from that list of chunks, we're going to grab the very first chunk. The very first chunk record will become chunk param inside of the query. So fantastic. You can see that the result of running that query is that we've created a new node, and here's the contents of that node. It's the metadata that we set before and then there's the text that we've always seen. This is perfect. Before calling the helper function to create a Knowledge Graph, Its job is to ensure that a particular property is unique for all nodes that share a common label. Let's go ahead and run that. We'll go ahead and show all the indexes. And you can see that the named index unique chunk is there, that it's online. Let's scroll down to see the end of this. Okay, great. It's created 23 nodes. Perfect. Just to sanity check that we actually have 23 nodes, we'll run another sanity check that we actually have 23 nodes, we'll run another query. Let's see. Fantastic. The index will be called form10kChunks and will store embeddings for nodes labeled as chunk in a property called text embedding. The embeddings will match the recommended configuration for the OpenAI default embeddings model. We can check that that index has been created by looking at all of the indexes. Great, we can see that we've got this form 10k chunks available. It's online, which means it's ready to go. It's a vector index and it's on the chunks for text embeddings, just as we asked. You can now use a single query to match all chunks. And then, This may take a minute to run depending on network traffic. And this is now what the graph looks like. We know that we've got chunks with text embeddings as a list and we don't yet have any relationships. This is exactly like what we did in the previous lesson, where we called Neo4j for actually doing the encoding, which would call back out to OpenAI, and using that embedding, we would actually store the value inside of the property called text embedding inside of the node. You may recall that the form we've turned into a knowledge graph is from a company called NetApp. We picked just one form. It happened to be this company NetApp. You can try out our new vector search helper function to ask about NetApp and see what we get. The Neo4j VectorSearchHelper function returns a list of results. Notice that we only performed vector search. If we want to create a chatbot that provides actual answers to a question, we can build a RAG system using Langchain. Let's take a look at how you'll do that. The easiest way to start using Neo4j with Langchain is with the Neo4j vector interface. This makes Neo4j look like a vector store. Under the hood, it will use the cipher language for performing vector similarity searches. The configuration specifies a few important things. These are using the global variables that we said at the top of this lesson. variables that we set at the top of this lesson. We'll convert the vector store into a retriever using this easy call as a retriever. The Langchain framework comes with lots of different ways of having chat applications. If you want to find more about the kinds of things that Langchain has available, I totally encourage you to go out to their website and check it out. It's pretty great. So, for this chain, and it's going to be using the retriever that we defined above which uses the Neo4j vector store. I also have a nice helper function here called pretty chain, which just accepts a question, it then calls the chain with that question, and just pulls out the answer field itself and prints that in a nice way formatted for the screen. Okay, with all that work done, we can finally get to the fun stuff and ask some questions. Since we know we have NetApp here, let's go ahead and ask, what is NetApp's primary business? We'll use pretty chain to pass that question in and just immediately show the response. Okay. We can see that NetApp's primary business is enterprise storage and data management, cloud storage and cloud operations. You can see that we have an actual answer to the question rather than just some raw text that might have that answer. This is exactly what you want the LLM for. Let's try another question to see what we get. Let's see if it can tell where NetApp has headquarters. Headquartered in San Jose, California. That is correct. Let's see what we get. Okay, I guess that is technically a single sentence. It is a bit rambly, but good job following our instructions. We're going to ask about a different company that kind of sounds similar to NetApp. There's Apple, the computer company, right? Hmm. The description of Apple seems a lot like the description of NetApp. This is classic hallucination. Let's try to fix this with a little bit more prompt engineering. If you are unsure about the answer, say you don't know. Okay, much better answer and much more honest from the LLM. Perfect prompt engineering for the victory. In this lesson, you've been using Neo4j as a vector store, but not really as a knowledge graph. Let's move on to the next lesson, where you'll add relationships to the nodes to add more graphy power to the chat application.