So I've done a ton of work in this area.
Few learnings I've collected:
1. Lexical search with BM25 alone gives you very relevant results if you can do some work during ingestion time with an LLM.
2. Embeddings work well only when the size of the query is roughly on the same order of what you're actually storing in the embedding store.
3. Hypothetical answer generation from a query using an LLM, and then using that hypothetical answer to query for embeddings works really well.
So combining all 3 learnings, we landed on a knowledge decomposition and extraction step very similar to yours. But we stick a metaprompter to essentially auto-generate the domain / entity types.
LLMs are naively bad at identifying the correct level of granularity for the decomposed knowledge. One trick we found is to ask the LLM to output a mermaid.js mindmap to hierarchically break down the input into a tree. At the end of that output, ask the LLM to state which level is the appropriate root for a knowledge node.
Then the node is used to generate questions that could be answered from the knowledge contained in this node. We then index the text of these questions and also embed them.
You can directly match the user's query from these questions using purely BM25 and get good outputs. But a hybrid approach works even better, though not by that much.
Not using LLMs are query time also means we can hierarchically walk down the root into deeper and deeper nodes, using the embedding similiarity as a cost function for the traversal.
Thanks for sharing this! It sounds very interesting. We experimented with a similar tree setup some time ago and it was giving good results. We eventually decided to move towards graphs as a general case of trees. I think the notion of using embeddings similarity for "walking" the graph is key, and we're actively integrating it in FastGraphRAG too by weighting the edges by the query. It's very nice to see so many solutions landing on similar designs!
> Hypothetical answer generation from a query using an LLM, and then using that hypothetical answer to query for embeddings works really well.
This is honestly wear I think LLM really shines. This also gives you a very good idea if your documentation is deficient or not.
Very interesting. Thank you getting into the details. Do you chunk the text that goes into the BM25 index? For the hypothetical answer, do you also prompt for "chunk size" responses?
Thanks for sharing! These are all very helpful insights! We'll keep this in mind :)
PageRank is one of several interesting centrality metrics that could be applied to a graph to influence RAG on structural data, another one is Triangle Centrality which counts triangles around nodes to figure out their centrality based on the concept that triangles close relationships into a strong bond, where open bonds dilute centrality by drawing weight away from the center:
https://arxiv.org/abs/2105.00110
The paper shows high efficiency compared to other centralities like PageRank, however in some research using the GraphBLAS I and my coauthors found that TC was slower on a variety of sparse graphs than our sparse formulation of PR for graphs up to 1.8 billion edges, but that TC appears to scale better as graphs get larger and is likely more efficient in the trillion edge realm.
https://fossies.org/linux/SuiteSparse/GraphBLAS/Doc/The_Grap...
This is super interesting! Thanks for sharing. Here we are talking of graphs in the milions nodes/edges, so efficiency is not that big of a deal, since anyway things are gonna be parsed by a LLM to craft an asnwer which will always be the bottleneck. Indeed PageRank is the first step, but we would be happy to test more accurate alternatives. Importantly, we are using personalized pagerank here, meaning we give specific intial weights to a set (potentially quite large) of nodes, would TC support that (as well as giving weight to edges, since we are also looking into that)?
Since when does “good old PageRank” demand an OpenAI API key?
“You may not: use Output to develop models that compete with OpenAI” => they’re gonna learn from you and you can’t learn from them.
Glad we’re all so cool with longterm economic downfall of natural humans. Our grandkids might not be so glad about it!
LLMs are only used to construct the graph, to navigate it we use an algorithmic approach. As of now, what we do is very similar to HippoRAG (https://github.com/OSU-NLP-Group/HippoRAG), their paper can give a good overview on how things are working under the hood!
This is very cool, I signed up and uploaded a few docs (PDFs) to the dashboard
Our Use case: We have been looking at farming out this work (analyzing complaince documents (manufacturing paperwork) for our AI Startup however we need to understand the potential scale this can operate under and the cost model for it to be useful to us
We will have about 300K PDF documents per client and expect about a 10% change in that document set, month to month -any GraphRag system has to handle documents at scale - we can use S3 as an igestion mechanism but have to understand the cost and processing time needed for the system to be ready to use duiring:
1. inital loading 2. regular updates -how do we delete data from system for example
cool framework btw..
Thanks! It sounds like we should be able to help. I'd love to chat more in detail, feel free to send me a note at antonio [at] circlemind.co.
Super interesting, thanks for sharing. How large a corpus of domain specific text do you need to obtain a useful knowledge graph?
Aider has been doing PageRank on the call graph of code repos since forever. All non trivial code has lots of graph structure to support PageRank. So it works really well to find the most relevant context in the project related to the currently active task.
We have tried from small novels to full documentations of some milion tokens and both seem to create interesting graphs, it would be great to hear some feedback as more people start using it :)
This is cool! How is the graph stored and queried? I’m familiar with graph databases, but I don’t see that as a dependency.
Have you tried the sciphi triplex model for extraction? I’ve tried to do some extraction before, but got inconsistent results if I extracted the chunks multiple times consecutively.
The graph is currently stored using python-igraph. The codebase is designed such that it is easy to integrate any graphdb by writing a light wrapper around it (we will provide support to stuff like neo4j in the near future). We haven't tried triplex since we saw that gpt4o-mini is fast and precise enough for now (and we use it not only for extraction of entities and relationships, but also to get descriptions and resolve conflicts), but for sure with fine tuning results should improve. The graph is queried by finding an initial set of nodes that are relevant to a given query and then running personalized pageranking from those nodes to find other relevant passages. Currently, we select the inital nodes with semantic search both on the whole query and entities extracted from it, but we are planning for other exciting additions to this method :)
It looks awfully similar to nano graphrag, but I fail to see any credits to it.
How does domain and example queries help construct the knowledge graph, or is that just context for executing queries.
These are knobs that you can tune to make the graph construction more/less opinionated. Generally speaking, the more we make it opinionated the better it fits the task.
At a high-level:
(1) Domain: allows you to "talk to the graph constructor". If you care particularly about one aspect of your data, this is the place to say it. For reference, take a look at some of the example prompts on our website (https://circlemind.co/)
(2) Example Queries: if you know what class of questions users will ask, it'd be useful to give the system this information so that it will "keep these questions in mind" when designing the graph. If you don't know which kinds of questions, you can just put a couple of high-level questions that you think apply to your data.
(3) Entity Types: this has a very high impact on the final quality of the graph. Think of these as the types of entities that you want to extract from your data, e.g. person, place, event, etc
All of the above help construct the knowledge graph so that it is specifically designed for your use-case.
Cool! But I'm confused on your pricing. The github page says first 100 requests are free but the landing page says to self host if you want to use for free. I signed up and used the dashboard but I don't see a billing section or option to upgrade the account.
Thanks for trying it out! There are two options to use FastGraphRAG for free:
(1) Self-hosting our open-source package (2) Using the free tier of the managed service, which includes 100 requests
If you wish to upgrade your plan, you can reach out to us at support [at] circlemind.co
So what is the answer to "Who is Scrooge?" and is it different / better than another approach?
( Like whole thing in contenxt window for instance? )
Is this approach just for cost savings or does it help get better answers and how so?
Could you share a specific example?
Generally speaking RAG comes in the game when it is impractical to use large context windows for three reasons: (1) accuracy drops as you stuff the context windows, (2) currently, context windows do not scale past 1M tokens, and (3) even with caching, moving millions of tokens is wasteful and not viable both in terms of costs and latency.
So we should really compare this to other RAG approaches. If we compare it to vector databases RAG, knowledge graphs have the advantage that they model the connections between datapoints. This is super important when asking questions that requires to reason across multiple pieces of information, i.e. multi-hop reasoning.
Also, the graph construction is essentially an exercise in cleaning data to extract the knowledge. Let me give you a practical example. Let's pretend we're indexing customer tickets for creating an AI assistant. If we were to store the data on the tickets as it is, we would overwhelm the vector database with all the noise coming from the conversational nature of this data. With knowledge graphs, we extract only the relevant entities and relationships and store the distilled knowledge in our graph. At query time, we find the answer over a structured data model that contains only clean information
Makes sense, but so can you compare it to to RAG then and show how an answer is superior and what the context contains that makes it superior?
Or how it is close to large context quality of answer with lower cost on some specific examples.
It's helpful when a readme contains a demonstration or as I said above, a specific example.
What solutions are folks using to solve queries like "How many of these 1000 podcast transcripts have a positive view of Hillary Clinton"? Seems like you would need a way to map reduce and count? And some kind of agent assigner/router on top of it?
We do a lot of things with podcast and other audio media at https://listenalert.com
But in general we found the best course of action is simply label everything. Because our customers will want those answers and rag won’t really work at the scale of “all podcasts the last 6 months. What is the trend of sentiment Hillary Clinton and what about the top topics and entities mentioned nearby”. So we take a more “brute force” approach :-)
At the moment this repo is designed to handle more RAG-oriented use cases, i.e. that require to recall the "top pieces of information" relevant to a given question/context. In your specific example, right now, FastGraphRAG would select the nodes that represent podcasts that are connected to Hilary Clinton, feed them to an LLM which would then select the ones that are positively associated with her. As a next step, we plan to weight the connections between nodes given the query. This way, PageRank will explore only edges which carry the concept "positively associated with", and only the right podcasts would be selected and returned, without having to ask an LLM to classify them. Note that this is basically a fuzzy join and so it will produce only a "best-effort" answer rather than an exact one.
I don't have a dev answer, but in case its relevant, I've seen commercial services that I imagine are doing something similar on the back end-- ground news is one of them. I wish they had monthly subs for their top tier plan rather than only annual, but it seems like a cool product. I haven't actually used it though.
What feature(s) of the top tier plan do you wish you had? I have no idea how their subs work but have seen a few ads for the product so have a vague idea that it rates news for bias but don’t see how that would involve many different tiers of subs.
It’s been a while since I looked, but unless they changed it, you needed the top tier plan to get a report analyzing the biases of your reading choices and recommending things to balance it out.
Ah, that’s an interesting feature. Would they make specific article or outlet recommendations or broad categories of suggestions?
Anticipate what kind of questions user might ask, pre-compute the answers and store them as natural sentences in a vector database.
llm can write graph queries
Just out of interest: why is every python file prefixed with an underscore? I’ve never seen it before. Is it to avoid collisions with package imports? e.g. “types”
It is to mark the package as private (in the sense that for normal usage you shouldn't need it). We are still writing the documentation on how to customize every little bit of the graph construction and querying pipeline, once that is ready we will expose the right tools (and files) for all of that :) For now just go with `from fast_graphrag import GraphRAG` and you should be good to go :)
It’s the standard practice of noting internal/implementation details. Users should stick with the public api exported in __init__.
Interesting, thanks! I’ve seen it for importable stuff but never for modules.
I guess I’m getting old
What is the difference to HippoRAG, which seems to be the same approach but came our earlier?
HippoRAG is an amazing work and it was a source of inspiration for us as noted in the references. There are a couple of differences:
(1) FastGraphRAG allows the user to make the graph construction opinionated and specialized on a given domain and for a given use-case; this allows to clear out all the noise in the data and yields better results; (2) Unlike HippoRAG, FastGraphRAG initializes PageRank with a mixture of semantic retrieval and entity extractions; (3) HippoRAG is the outcome of an academic paper, and we saw the need for a more robust and production-ready implementation. Our repo is fully typed, includes tests, handles retries with Instructor, uses structured outputs, and so on.
Moving forward, we see our implementation diverge from HippoRAG more radically as we start to introduce new mechanisms such as weighted edges and negative PageRank to model repulsors.
Could this be used for context retrieval and generative understanding of codebases?
Yes, feel free to try it out! You can specialize the graph for the codebase use-case by configuring the graph prompt, entity types, and example questions accordingly.
Neat, we are doing something similar with cognee, but are letting users define graph ingestion, generation, and retrieval themselves instead of making assumptions: https://github.com/topoteretes/cognee
does FastRAG integrate with other graph databases like neo4J ?
We are building connectors for that, so it will soon :) At the moment we are using python-igraph (which does everything locally) as we wanted to offer something as ready to use as possible.
I'd like to partner to see if a connector to a graph db can be mutually beneficial and provide some value to users. How do I reach out ? NOTE: Im not from Neo4j
That would be awesome, we have a discord you can join and we can talk there (link is in the github repo, message Antonio) or you can message antonio [at] circlemind.com
Can this method return references to source documents?
Yes. We already support this feature in our managed service (https://docs.circlemind.co/essentials/query#include-referenc...) and we'll include it in the next open-source release too!
Please tell me I’m not the only one that sees the irony in AI relying on classic search.
Obviously LLMs are good at some semantic understanding of the prompt context and are useful, but the irony is hilarious
I think the main bit here is that the knowledge graph is entirely constructed by LLMs. It's not just using a pre-existing knowledge graph. It's creating a knowledge graph on the fly based on your data.
Navigating the graph, on the other hand, is the perfect task for PageRank.
Exactly! Also PageRank is used to navigate the graph and find "missing links" between the concepts selected from the query using semantic search via LLMs (so to be able to find information to answer questions that require multi-hop or complex reasoning in one go).
Makes perfect sense.
The semantic understanding capabilities fit well for creating knowledge graphs.
I don't get what the irony is here.
Not who you're replying to, but from my vantage point, marketing folks seem to be pushing LLM products as replacements for traditional search products. I think what the post is proposing makes perfect sense from a technical perspective, though. The utility of LLMs will come down to good old-fashioned product design, leveraging existing concepts, and novel technical innovation rather than just dumping quintillions of dollars into increasingly large models and hardware until it does everything for us.
Exactly this.
I work in the LLM-augmented search space, so I might be a little too tuned in on this subject.
Wonder why this all - here on HN - is not part of the readme .md which says absolutely nothing about how and why this all would work.
The whole approach to representing the work, including the writing here, screams marketing, and the paid offering is the only thing made absolutely clear about it.
p.s. I absolutely understand why a knowledge graph is essential and THE right approach for RAG, and particularly when vector DBS on their own are subpar. But so do know many others and from the way the repo is presented it absolutely gives no clue why yours is _something_ in respect to other/common-sense graph-RAG-somethings.
You see, there are hundreds of smart people out there who can easily come to conclusion data needs to be presented as knowledge in graph-ontological way and then feed the context with only the relevant subgraph. Like, you could’ve said so much rather than asking .0084 cents or whatever for APIs as the headline of a presumably open repo.
HN is “for” startups. This is a startup. What’s the problem?
I completely agree that the README could do a better job explaining the implementation details and our reasoning behind key design choices. For instance, we should elaborate on why we believe using PageRank offers a more effective exploration strategy compared to other GraphRAG approaches.
FastGraphRAG is entirely free to use, even for commercial applications, and we’re happy to make it accessible to everyone. The managed service is how we sustain our business.
Can this be used with LLMs other than the OpenAI API?
Yes, it works out-of-the-box with any OpenAI compatible API, including Ollama.
You can check out our example at https://github.com/circlemind-ai/fast-graphrag/blob/main/exa...
it would be nice to see an example that uses ollama - given that ollamas embeddings endpoint is a bit... different, I can't quite figure this out
Hey! Our todo list is a bit swamped with things right now, but we'll try to have a look at that as soon as possible. On the Ollama github I found contrasting information: https://github.com/ollama/ollama/issues/2416 and https://github.com/ollama/ollama/pull/2925 They also suggest to look at this: https://github.com/severian42/GraphRAG-Local-UI/blob/main/em...
Hope this can help!