Knowledge Graphs for RAG - DeepLearning.AI

Loading...

Welcome back!

We'd like to know you better so we can create more relevant courses. What do you do for work?

Subscribe to receive AI news, events and course updates from DeepLearning.AI!

Course Syllabus

AI Python for Beginners is a sequence of 0 connected courses. You can navigate to the other courses by clicking on the cards below

Explore Courses
Community
My Learnings

You’ve achieved today’s streak!

Complete one lesson every day to keep the streak going.

Su

Mo

Tu

We

Th

Fr

Sa

You earned a Free Pass!

Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.

Okay, so now that you have your completed graph, you can get to the fun part, chatting with the SEC documents. Let's dive in and ask some questions. You've created a Knowledge Graph. And how did you do that? You started with what I would call the minimum viable graph, so just a very small graph that actually has the data inside of the knowledge graph, that's good. And then, there's kind of a pattern that we went through together where Where extracting, I mean like finding interesting bits of information that we want to pull out of the data that we have into separate nodes. Enhancing meaning supercharging the data somehow, whether that's with a vector embedding or something else. And then expanding, I really just mean connecting the data to the graph that you already have. Think back to the beginning of the course when you had some form 10k data and you were importing that. You had the raw data, we split it into some text, we just did the chunking right. That's exactly the pattern I'm talking about. You extracted some information from the original source data, you enhanced that with the embedding. And then, Throughout the rest of the lessons, you continue to that same pattern, whether it was starting with the text nodes and creating chunks out of those, or then creating some forms from the chunks themselves, or then from some new data, the Form 13 CSV, creating the companies, and then, the managers. At each step, You enhanced that data with something like an index. And then, You went from chunk to next chunk, chunk, that's a part of a form, a company that filed that form. And finally, Each of these filings mention, you know, other companies that they're involved with, whether it's suppliers or maybe partners. You could find those, and then actually link the companies, right? Or you could also add external data sources, as we did with the Form 13 CSV. And finally, you could take another step, which is really great for making this a great user experience, add users into the graph so that users can provide feedback about the answers that they're seeing, whether it's good or bad. You can keep track of their interactions and actually learn from those interactions to improve the overall quality of the graph and the quality of the experience as they go along. The more they engage, the better their engagement becomes. For Lesson 7, the data that you've got to work with has already been expanded a little bit. Both the companies and the managers have address strings. That lets you do interesting new queries, because as soon as you have a separate node for these addresses, you can do something like add a geospatial index, which lets you do distance-based queries like what's near me, or is this company near another company? The resulting graph looks like this. It's what we've seen before, managers that own stock in companies who filed some forms that have been chunked up. Now both the managers and companies are also connected to addresses. With those addresses in the graph, you can ask interesting questions like what companies are near each other just by following pointers in the graph. Some company that's located an address can go back to another company that was attached to that address. You can do the same thing for companies across the managers and so forth. It's actually kind of fun to think about. So this is the schema of the graph you'll be working with in this notebook. You'll be exploring the knowledge graph just a little bit more. First, with some cipher queries to directly explore the graph and then using langchain to create a question and answer chat. Finally, you will use the LLM to combine both of those techniques in a fun new way. As usual, we'll start by importing some libraries, and also defining some global variables that we use throughout the queries. We'll create an instance of Neo4j graph. Now you're ready to do some exploration with Cypher. Start with something simple. Let's find just any random manager. So we'll just match a pattern from a manager who's located at some address and return the manager and the address and we'll limit that to one so whatever comes up first you can see we've got kdn capital management here who's in new york city that's perfect so we have both the manager who's at an address and then what that address is notice here this extra part that we've added not just the city and state to the address node but also something called location where we've stored what's called a point. That's just the latitude and longitude stored as a data value. This is what lets you do things like geospatial search and also nearness search. So that was a random manager. What if you want to find somebody specific? So where in this data set there's somebody called Royal Bank, we can do a full text search for that and then return that manager. Great, there's the manager name, it's actually Royal Bank of Canada. Notice we also got a score here. That's the score from doing the full text search, which is a different value range than you get out of the vector search. But it's the same idea, higher scores are a better match. You can of course then combine those two queries, looking for the Royal Bank and then finding out where they're located. First find the bank with the full text and then from that manager, where they're located at to an address and return those values. No shock, the Royal Bank of Canada is in Canada. So we can start to do some more exploration here to find out what kinds of information we know about. For example, you could ask which state has the most investment firms. You start with a pattern match of a manager who's located at an address, and then you return results that include the name of the state, and then counting number of times that state appears by doing an aggregation account of that state and saying that that's going to be the number of managers. Return that limited by 10. That should be the top 10 managers per state. Okay, no surprise there. See lots of people in New York, lots of management firms in California as well, and then a couple throughout the rest of the states. You can ask the same kind of question, but now about the public companies instead. It's the same idea. You have the pattern match from a company to an address, list the states from the address, and then do an aggregation on the address state and get that as the number of companies. Okay, so it seems between management and investment firms and the companies that we have, at least in the sample data set, most of them end up being in California. That's where the density is. So let's drill down into California a little bit more. You might ask, what are the cities in California with the most investment firms? We can continue with the same approach. Find the pattern that matters from the managers who are located at an address. And then we're going to constrain the address to being state California. And the same idea return what the city is count the cities the number of times the city appears and return that limited by 10. So there's a bit of northern California southern California rivalry going on here very heavy in the north very heavy in the south and then kind of a mix everywhere else. Continuing our same line of questioning we we ask that about the managers. We'll ask the same thing about the companies, where most of these companies in California located in terms of city. Okay, now this is a very different list. Now we know there's only 10 companies in this sample data set. They're in Santa Clara, San Jose, Sunnyvale, and Cupertino. I don't remember those names showing up in the management list. Let's take a look back a little bit. So it looks like the public companies and the managers are in different cities entirely. That's interesting. You notice there are a lot of investment firms in San Francisco. We can now drill down to San Francisco itself, of course, and say, okay, within San Francisco, what are the top management firms? It's the same pattern we had before for a manager who's located at an address. So we're going to say that the city must be San Francisco. In addition to the manager name, whatever that company has invested in, take the sum of all the value property of those relationships and call that the total investment value and then return that limited to 10 descending. So this would be the top 10 management firms in San Francisco. Okay, some of these names you might recognize. some of these are pretty unique at least to me, but the numbers are pretty impressive. So those are the top investment firms in San Francisco. If we look back here at the companies, for this sample set we had most of them in Santa Clara. Let's figure out who those companies are. This is pretty straightforward. We're going to match from the company that they're located in address where the address is city is Santa Clara. Give me those company names. Palo Alto networks, Seagate technology, and Atlassian corporation. So far, you've been exploring the graph using explicit relationships. You can also find things based on their location coordinates. Remember we added a geospatial index. This gets pretty interesting. It's a lot like doing vector search but within a two-dimensional space and using our Cartesian distance rather than cosine similarity. You can ask what companies are near Santa Clara, which is similar to companies that are in Santa Clara, but we don't want them in Santa Clara, but simply nearby. How do you approach doing that? First, match an address, we'll call it SC, and where we want SC city to be Santa Clara. Okay, that gets us Santa Clara. And then we want companies that are located at some company address. And this is the interesting part. It's this line here. Our where statement is going to say, take point dot distance, which is a distance function that we have built into the cipher, point dot distance from two different locations, the SC location, which is where Santa Clara is, and then whatever the com address location is, that's the location of this company. And we want that to be less than 10,000. And it's 10,000 meters, okay? All the distances here are measured in meters. Whatever satisfies that then will return the company name and the company address as listed in full text within the company node itself. You can see Palo Alto Networks, which is in Santa Clara, is of course near Santa Clara, that's good. You also have Sunnyvale, Santa Clara appears again, and then Cupertino appears down here for Apple. But this is great. So now we're doing things two different ways. Either you know exactly what you want and people in Santa Clara, or things near Santa Clara, and that's a distance function. You could play around with this a little bit to see if you go a little bit further, what else do you get? Yeah, as you'd expect, the further you go out, the more management firms you find. As you've done before, you can take these individual queries and combine them to do even more interesting things. For example, you could ask, rather than investment firms near some location, let me find investment firms that are near some company you know about. You might recall seeing Palo Alto Networks. Here I've even misspelled it. That's fine. recall seeing Palo Alto Networks. Here I've even misspelled it. That's fine. We'll go ahead and run a query that takes even that misspelling, finds the company, then finds what management firms are near that company in terms of distance. So we start with the full text query. We're going to look through the full text company names for the misspelled Palo Alto Networks. We're going to get that node that comes out of the query plus the scoring of that node. Here we're just going to rename it to com because we know that represents a company. Then a pattern match that goes from the company that's located at some company address. Similarly, find a manager that's located at some manager address. And here notice these two patterns don't have any connection. You have a where clause that says find that point dot distance between the com address location and the manager address location. Again we'll look within 10k and then return the manager and convert the distance into kilometers. Okay so even with the misspelling we found Palo Alto Networks, we found out who the management company is, and they're called Mine and Arrow Wealth Creation and Management, LLC. You've been reading a lot of Cypher in this lesson. This can be a lot to take in at once. A good thing to try is pausing the video and making some small changes to the queries that you have in the notebook. Try different distances, company names, and cities to see what you get. There are lots of resources online for helping you learn Cypher. You can, of course, visit neo4j.com to read the docs and explore the learning material there. But this is the age of generative AI. It turns out that OpenAI's GPT 3.5 model is pretty good at writing Cypher. Let's try that next. To ask an LLM to write cipher, you can use a technique called few shots learning. With few shot learning, you provide examples in your prompt that tell the LLM how to complete a particular task. Then you ask the LLM to perform that task. For example, let's look at a prompt that you will use to teach the LLM about knowledge graphs and writing cipher. It starts pretty easily. The task is generate cipher statements to query a graph database. Here are the instructions. Use only the provided relationship types and properties in the schema. Do not use any other relationship types for properties that are not provided. Basically, we're asking the LLM, please just follow our instructions. Don't go off the rails. Next, we provide the actual schema. We say, here's the schema, colon, and here with the curly braces inside of that, the schema for the knowledge graph will get passed in to the LLM's prompt. All that will get packaged up. As a standard practice when creating a prompt, include lots and lots of guidance to the LLM. Say please do not include explanations or apologies. Just write the cipher. Don't respond to questions that are not about this cipher writing statement. Don't create anything else other than what we're asking you to do. You finally then provide some examples and here we're going to provide it with just one. So say here's the example we have of generated cipher for a particular question. We've got a hash symbol and then the question itself in natural language. So you've seen this before when we were doing the cipher right? So what investment firms are in San Francisco? And then here's the cipher pattern that should be written. It's the manager located at an address, and then with a where clause, and here it's a string literal for San Francisco rather than passing in a query variable. And then just return the manager names. That's enough to get the LLM going. Finally, you close off the prompt with where the question itself ends up being. This is, hey, perform the task. Here's the task I want to ask you about. So if somebody passes in a question that is what investment firms are in Santa Clara or any other city, the LLM should have learned from this one example what the pattern looks like, what the cipher generated should look like, and it should replace San Francisco with Santa Clara depending on the question that was asked. To create a workflow around this we're going to use LangChains integration, which is pretty convenient. To start, we'll take the Cypher generation template, we'll turn it into a Cypher generation prompt with this prompt template class. The details don't matter for our purposes today. Next, we'll create a new kind of chain. Before we had question and answer chains. This is a different one. This is a GraphCypher QA question and answer chain. Before we had question and answer chains. This is a different one. This is a graph cipher QA question and answer chain. From an LLM, we're going to use chat open AI. The graph is going to be our knowledge graph that we've been using for directly querying Neo4j. And we want it to be verbose. So tell us what's going on as you're doing stuff. And the cipher prompt to use is what we've created up above, the cipher generation prompt. I'll then add a small utility class just to keep things looking nice. This is similar to what we've done before, I'm just gonna use text wrap for making things nice and tidy. Naturally, we should try things out by asking exactly what we told the LLM about. We had a question about what investment firms are in San Francisco, we told it how to solve that. Let's see if it can actually solve that. This is so cool. You see the cipher that was generated, and if you scroll back up in the notebook, you see it's very much exactly like what we asked for and also what we've said directly to Neo4j ourselves. Okay, let's try the same query but with a different city. Fantastic. You can see that the WHERE clause was changed to look for a string literal of Menlo Park. That got the right answer. Here's the management firms that are there. And then, of course, just a statement about it at the end result. Let's try something that it hasn't been taught how to do yet. We taught it how to do investment firms. What about companies? Let's find companies that are in Santa Clara. We know that there should be answers for this. Okay, we didn't even teach specifically how to solve this, but with that one example and knowing the schema of the graph, the LLM was able to generate a Cypher query that found a pattern match from a company that's located in the address, where this address city is Santa Clara, just like we asked for. And then we got the result. In Santa Clara, there's Palo Alto Networks, Seagate, and Atlassian. Okay, so that was a variation about companies instead of investment firms. Let's try a different variation. If you recall before, instead of finding things that are in a particular city, we want to find things near a city by doing a distance calculation. Will that be something the LLM can figure out how to write? Let's see. Now, all that we've taught it how to do is how to do the pattern matching, how to do a where clause and then return values from that. So it needs to be taught a little bit more in order to be able to answer a question like this. We can do that by just changing the prompt and giving it a few more examples of the different things that it can do. Here it didn't know that it was possible to do a distance query. Let's add that to the prompt. Looking back through the notebook, we already have an example of asking this question. There is a cell about what investment firms are near Santa Clara. Let's just paste that in place and we'll get rid of the extra parts that we don't need. Conveniently, we already have the question what investment firms near Santa Clara. We'll get rid of this part here, which is making the call to give a J through the KG, the knowledge graph interface. Maybe the white space is okay, but we'll clean it up a little bit so you can see it all online. This is the new example we're adding to the few examples we have for the LLM. Now, once you've changed this Cypher generation template, we also have to update all of the other things that were built from this. We'll go back and do that first and we'll return to here. The Cypher generation prompt should be updated. Then because the Cypher chain uses that, we'll run that again and pretty chain will be just fine. With those updates, we'll ask this question again and see what the LLM produces. Let's scroll up to see the complete result. We taught it how to do the query with the point at distance calculation in the where clause. Let's take a closer look to see if this is correct. We wanted investment firms near Santa Clara. Matching for the address with wanted investment firms near Santa Clara. So it's matching for the address with the address city, Santa Clara. Great, that's the place we wanna have something near. Then the pattern match for a manager located at some address. And now the distance calculation is from the address location, that's Santa Clara, right? And the manager address location. That's good, that's for the manager. And that has to be less than 10,000. It got the right query and it looks like it got the right results. Great job, LLM. Okay. I think this is really cool. GPT 3.5 has seen enough Cypher, even just with two examples, to do a pretty good job of generating it. This is actually kind of fun. You can provide one more example here that connects from the companies down to the SEC filings that we started with. We know that the first chunks that we have are from item 1, and item 1 talks about what does the business actually do. Let's ask a question that's about that. We're playing with Palo Alto Networks. For example, we could say, let's teach the LLM to answer the question, what does Palo Alto Networks do? You could then ask this question about any of the other companies. The Cypher example that we'll provide to the LLM is going to use a full text search to find the name of the company. So here it's exactly spelled correctly, Palo Alto Networks. And then we're going to do a match from that company, we've renamed the node to com, from the company that filed some form, and then continuing that F, the F, which is a form through a section to a chunk. Now the section is important, that section relationship is what we're using for the beginning of a section of chunks, right, the head of the linked list. So we want that section to be where the F10k item of a section of chunks, right? The head of the linked list. So we want that section to be where the F10K item of that section is called item one. But to get us to the first chunk of the section of chunks that are part of item one, return the text from that chunk. And that's going to be what we provide to the LLM to actually answer the question itself. Save that, recreate the chain. And then we'll ask a question about what does Palo Alto networks do? Now let's ask the question and see if we get a cipher query that looks like the one above. Let's take a look at that. So it's doing the full text, it's doing the pattern match just as we wanted. It's got the right item one for the section. And then here's all the answers that came back. It's the full text of the chunk. And then using that full text from the chunk, it's produced a final answer, which is a shorter paragraph about Palo Alto networks, the global cybersecurity provider. You're at the end of this lesson. Before moving on, pause the video and experiment with the prompt, the Cypher queries that are given as examples to the prompt, and asking different questions. See whether the LLM can generate Cypher that is appropriate for those questions. When it can't, just showed a few more examples, look through the notebook and find an example that's relevant. Add that to the original template, update all the chains, and then run the question again to see what you get. When you're done, join me in the final video to wrap up.

course detail

How Was Your Experience

Thank you for taking the time to provide feedback on your course experience! Please take a moment to rate the course and share any comments you may have.

Would you recommend this short course to people in your network? (0=Not likely, 10=Extremely likely)
012345678910
Feedback about the Course:
Feedback about the Platform:

Loading...

Learn Code

Next Lesson

Knowledge Graphs for RAG

Introduction

Knowledge Graph Fundamentals

Querying Knowledge Graphs

Preparing Text for RAG

Constructing a Knowledge Graph from Text Documents

Adding Relationships to the SEC Knowledge Graph

Expanding the SEC Knowledge Graph

Chatting with the Knowledge Graph

Conclusion

Course Feedback

Community

0%