Building long context RAG with RAPTOR from scratch
Summary
TLDRThe video discusses the use of long-context LLMs for projects like code assistants, highlighting the benefits and limitations. It introduces a new method called Raptor for efficient retrieval with long-context models, which involves clustering and summarizing documents to create a hierarchical document tree for robust information retrieval. The approach is demonstrated using Anthropic's new model, CLA 3, and emphasizes its potential for handling large document sets while avoiding context limits.
Takeaways
- đ Introduction of long-context LLMs and the Raptor retrieval method
- đ€ Use of long-context LLMs in projects like code assistants for specific programming languages
- đĄ Importance of considering the cost and latency in long-context LLMs
- đ Evaluation of 20 questions in different runs showing variance in performance and cost
- đ Comparison between long-context LLMs and RAG systems for retrieval
- đ Discussing the p50 and p99 latency metrics for understanding performance
- đą Consideration of using a local LLM like Mystiko 7B V2 with a 32,000 token context window
- đł Idea of building a document tree for efficient retrieval and summarization
- đ Raptor's approach of recursively clustering and summarizing documents
- đ Utilizing KNN on embedded documents for retrieval without chunking
- đ Open-sourcing of Raptor's code and its potential integration with LLMs like CLAE 3
Q & A
What is the main topic discussed by Lance from Lang chain?
-The main topic discussed by Lance is retrieval and long context LLMs, specifically focusing on a new method called Raptor.
What are some of the long context LLMs mentioned in the transcript?
-Some of the long context LLMs mentioned include Gemini, Claude 3, and the use of a long context LLM for coding questions in the Lang chain expression language.
What are the considerations Lance wanted to point out regarding the use of long context LLMs?
-Lance pointed out considerations such as the latency and cost associated with using long context LLMs, as well as the limitations when dealing with documents larger than the context window.
What is the Raptor method and how does it aim to address the limitations of long context LLMs?
-The Raptor method is a retrieval strategy that builds a document tree, creating higher-level summaries of clusters to consolidate information from different documents. This approach allows for the integration of information across various documents, addressing the challenge of retrieving comprehensive answers using traditional KNN methods.
How does the Raptor method handle the retrieval of information from multiple documents?
-The Raptor method handles retrieval by building a document tree structure where each document or set of documents is summarized and indexed. This allows for the retrieval of information not only from individual documents but also from higher-level summaries, capturing information from multiple sources when needed.
What is the significance of using a Gaussian Mixture Model (GMM) in the Raptor method?
-The GMM is used to model the distribution of clusters and documents across the clusters. It allows for the automatic determination of the optimal number of clusters and assigns probabilities to each document for cluster membership, enabling a document to belong to multiple clusters if appropriate.
What is the role of UMAP in the Raptor clustering process?
-UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique used in the Raptor method to improve the clustering process by better visualizing and understanding the relationships between data points in a lower-dimensional space.
How does the Raptor method deal with the challenge of setting the K parameter in KNN retrieval?
-The Raptor method mitigates the challenge of setting the K parameter in KNN retrieval by building a document tree that includes higher-level summaries. This structure allows for the integration of information from multiple documents, ensuring that the retrieved information is comprehensive even if the K parameter is not perfectly set.
What is the advantage of using the Raptor method for long context LLMs?
-The Raptor method is advantageous for long context LLMs as it provides a lightweight and easy-to-use retrieval strategy that can handle documents larger than the model's context window. It also offers robustness in retrieving answers that may require information integration from multiple documents.
What model did Lance decide to use for summarizing individual documents in the Raptor method?
-Lance decided to use Anthropic's new model, CLA 3, for summarizing individual documents in the Raptor method due to its strong performance and suitability for handling large documents.
How does the Raptor method ensure that the retrieved information is comprehensive and relevant?
-The Raptor method ensures comprehensive and relevant information retrieval by creating a document tree with higher-level summaries and indexing both the raw documents and these summaries. This allows for the possibility of retrieving from a broad pool of information, including detailed raw documents and consolidated summaries, to answer a wide range of questions effectively.
Outlines
đ€ Introduction to Retrieval and Long Context LLMs
Lance from Lang chain introduces the topic of retrieval and long context language models, specifically focusing on a new method called Raptor. He discusses the recent developments in long context LLMs like Gemini and Claude, and shares his experience using them for projects such as a code assistant. Lance highlights the benefits of using long context models for answering coding questions, emphasizing the efficiency of context-based answers without the need for retrieval. However, he also points out the need to consider the limitations and costs associated with using very long context models, leading to the exploration of retrieval strategies that can complement long context models.
đ Evaluation and Considerations of Long Context LLMs
Lance presents an evaluation of long context LLMs by examining 20 questions and their generation times, noting the variance in latency and cost. He discusses the trade-offs between using long context models and RAG systems, highlighting the challenges of setting the K parameter for retrieval strategies. The conversation then shifts to the possibility of using a local LLM, such as the 7B V2 model, and the considerations that arise when dealing with document sizes larger than the model's context window. Lance shares his thoughts on the need for a lightweight retrieval strategy that can handle larger documents while still utilizing long context.
đ Raptor: A Retrieval Strategy for Long Context LLMs
Lance introduces Raptor, a recently published paper on a retrieval strategy for long context LLMs, which has been open-sourced and adapted into a tool by the llama index community. Raptor's approach involves clustering and summarizing documents at various levels, creating a document tree that allows for the integration of information from multiple sources. This method addresses the challenge of setting the K parameter in retrieval by building a hierarchy of document summaries, which can be used to retrieve information at different levels of abstraction. Lance emphasizes the potential of this approach to complement long context models and provide more robust retrieval across a range of questions.
đ Implementing Raptor with Anthropics' New Model
Lance outlines the process of implementing Raptor using Anthropics' new model, CLA 3, to summarize individual documents without worrying about their size. He describes the process of clustering and summarizing web pages related to the Lang chain expression language, using a histogram to illustrate the token count distribution of each page. Lance explains the clustering process, which involves a Gaussian mixture model (GMM), UMAP for dimensionality reduction, and local and global clustering analysis. The process allows documents to belong to multiple clusters, enhancing the retrieval strategy. He shares the code used for clustering and summarizing, and discusses the benefits of building a document tree for retrieval purposes.
đ Retrieval with Raptor and Long Context Indexing
Lance concludes by discussing the application of Raptor for retrieval, emphasizing its compatibility with long context models. He explains how the Raptor approach allows for indexing of both raw documents and higher-level summaries, enabling the retrieval of information at different levels of detail. Lance demonstrates the setup of a retrieval chain using an index that contains raw web pages and summaries, and shares the results of a retrieval attempt that combines both raw and summary pages. He acknowledges that while the full context for this specific case could be handled by a single model, the Raptor approach is beneficial for cases where document sizes exceed the model's context limit, making it a versatile solution for various scenarios.
Mindmap
Keywords
đĄRetrieval
đĄLong-Context LLMs
đĄRaptor
đĄLatency
đĄCost
đĄMistil 7B V2
đĄDocument Level Indexing
đĄDocument Tree
đĄGMM (Gaussian Mixture Model)
đĄUMAP (Uniform Manifold Approximation and Projection)
đĄThresholding
Highlights
Discussion of Retrieval and Long Context LLMs
Introduction of a new method called Raptor
Mention of the shift from RAG to long context LLMs like Gemini and Claude
Use of long context LLMs in projects such as a code assistant
Challenges with latency and cost in long context LLMs
Comparison between long context LLMs and RAG systems
Considerations for using long context LLMs
Evaluation of 20 questions using long context LLMs
Discussion on the possibility of using a local LLM like Mistol 7B V2
Proposal of minimalist splitting strategies for long context LLMs
Idea of indexing at the document level for retrieval
Concept of building a document tree for retrieval
Overview of the Raptor paper and its open-sourcing
Explanation of the clustering process in Raptor
Use of Gaussian Mixture Model (GMM) in Raptor's clustering
Integration of UMAP for dimensionality reduction in clustering
Description of the iterative summarization process in Raptor
Creation of a long-context index for retrieval using Raptor
Demonstration of retrieval using both raw documents and higher-level summaries
Potential application of Raptor for documents exceeding the context limit of LLMs
Transcripts
hi this is Lance from Lang chain I'm
going to be talking about retrieval and
long context llms and a new method
called
Raptor so over the last few weeks
there's been a lot of talk about is rag
Dead with the Advent of new long Contex
llms like Gemini a million tokens Claude
3 now with up to a million tokens it's
an interesting
question um I've recently been using
long Contex llms for certain projects
like like for example this code
assistant that I put out last week
basically used a long context llm to
answer coding questions about our docs
on L expression language I'll kind of
zoom in here so you can see it um so
these are around 60,000 tokens of
context we take the question we take the
docs we produce an answer and this is
really nice no retrieval required just
context stuff all these docs and perform
answer generation directly
so I'm a big fan of using La context
llms in this
way but there are some considerations I
wanted to like to to kind of point out
here so I ran evaluations and for those
evaluations I look at 20 questions um so
basically it's 20
Generations now look here so this is the
Langs Smith dashboard that I used for
those EV vals and you can see something
kind of interesting the p50 latency
tells you the 50th percentile latency
for each of those Generations um so
again remember there's
20 so it's around 35 to you know 46
seconds depending on the the trial this
is on the same data set same 20
Questions there's some variance run to
run so that's kind of expected and again
the P99 it's up to like okay 420 seconds
that's really long in that
case but maybe more interestingly if you
look at the cost again there's 20
questions so the cost is ranging from
maybe like a dollar to you know a dollar
a dollar like 30 per
generation so you know C and Lanes your
things to think about when you're
talking about using really long Contex
llms as opposed to like a rag system
where you're per you're performing
retrieval of much smaller more directed
chunks to your
question now the other thing that came
up is a lot of people asked hey can you
swap this out and use a local
llm and my go-to local llm is mistol 7B
V2 which actually has a 32,000 token
context window but that's still a little
bit big relative to my docs which are
around 60,000 tokens so you know I
couldn't just context stuff them as I
did here so these three considerations
kind of led me to think
about I really like working with long
context models and it's absolutely going
to be the continuing thing but are there
retrieval strategies that are like
lightweight easy to use with long
context models um that kind of like
preserve the ability to utilize a lot of
context uh but can address some of these
limitations um in particular this last
piece was important because this is
something I want to do kind of in the
near term and I need kind of like a nice
lightweight retrieval strategy that
still uses long context but can operate
in cases where my documents are maybe
just a little bit bigger than my context
window in this case like around
2x so I kind of put this out on Twitter
and said hey has anyone come across like
good like maybe minimalist splitting
strategies for long contuct LMS you know
like I wanted to graag with mrol 7B with
a 32,000 token context window but my
docs are 60,000 tokens I can't just
context stuff them but I also don't want
some like very fine scale chunking thck
thing like I get it we don't want to
mess with all that we want something
simple that just can like kind of work
across larger
documents so one point that was raised
which is a really good one is well just
um just index at the document level so
you can take full documents and just
embed them directly it's a fair point
and then you do something like KNN on
those embedded documents so again no
chunking of any documents no splitting
of documents you have your set of
documents embedded one and just retrieve
at the document level that's a pretty
good idea that's pretty
reasonable another idea that came up
though is this idea of building a
document tree and part of the reason for
that is when you talk about something
like KNN or like you know K nearest
neighbor retrieval on a set of embedded
documents it is true that sometimes an
answer requires maybe two or three
different documents kind of integrated
in order to answer it now if you context
St everything that's not a problem
because it's all there if you're doing
retrieval well you're setting your K
parameter to be some value it's kind of
brittle do you need to be like four or
five or six to capture all the context
needed for certain particular questions
so it's kind of hard to set that so this
idea of building a documentary is kind
of an interesting way to potentially
address this challenge with like basic
Cann so a paper Raptor came out recently
on this exact
idea um and their code recently open
sourced which led the folks at llama
index to come out with a llama pack for
it which is great um and the idea is
pretty interesting so I wanted to kind
of lay it out here and talk about how it
might benefit this exact case of kind of
long context
retrieval so the intuition is pretty
simple First We Take a set of documents
now note that these documents can be any
sized so in their case they're just
chunks so they're like 100 tokens but it
doesn't matter so we start with a set of
raw documents now what we do is we embed
them and then we cluster them so this
clustering process groups together like
documents and then we do one important
thing we summarize information in that
cluster into what we call kind of like a
more abstract or higher level summary of
that
content and we do that recursively until
we end up with one cluster that's it so
what's happening is you're starting with
the set of what they call leaves or like
raw documents you do a
grouping uh via clustering you do a
summarization steps you're kind of
compressing and then you do it again and
the idea is that these kind of midlevel
or eventually like root level or highest
level summaries can consolidate
information from different places in
your documents now what they do is they
basically just embed those summaries
along with the raw leavs and they
perform retrieval and we'll talk about
that a little bit later but what they
show is actually just doing retrieval on
all of these together like as a whole
pool performs best um and that's kind of
a nice result it's pretty easy then to
basically just index that and and use
it I will make a note that their paper
talked about you know these leavs being
chunks which I didn't love because look
I want to work with long context models
and like I don't want to deal with
chunking at all and I've replied you
know I replied to Jerry's tweet on this
and Jerry made a fair point that you
know this can scale to anything so for
example those leavs can be full
documents they don't have to be chunks
that's completely reasonable Point um so
again you can kind of think about this
as if idea one was let's just take each
document and embed it idea two is well
let's embed each document like we did
and we can also build kind of like a a
document abstraction Tree on top and
embed those so we have these like higher
level summaries in our embeddings which
we can retrieve from if we need an
answer to conate information from like a
small set of documents right so it's a
little bit more robust maybe to this
problem which is that if I'm just doing
KNN on Raw documents and I need
information from like two or three
documents I'm not guaranteed to always
get that because of this K parameter
that I set I'm only retrieving k docks
whereas here I'm building these docks
that contain information from multiple
leaves or multiple you know suboc so to
speak um and it can actually just
capture that information uh in in a in
kind of a a nice way um such that it can
it can basically integrate information
across different individual leads or
individual documents so that's the key
Point um and so we can you can kind of
see when you think about like working
long context models of course context
stuffing is a great option if you can do
it but there are some other interesting
ideas one is actually just embedding
full documents and another is this idea
of again documents and an abstraction
tree so let's go ahead and just build
Raptor because it's pretty interesting
and to do this I'm actually going to
going to look at clae 3 which just came
out today it's a new set of model Str
anthropic really strong performance and
should be really good for this use case
because what I want to do is I want to
perform summaries of individual
documents and I don't really want to
worry about the size of those
documents um so I'm going to use the
same set of documents that I previously
did with the code generation example
that video came out last week and I have
an empty notebook here um it we just do
a few pip installs I'm setting a few
environment variables for lsmith and now
I'm just going to say grab my docs so
that's right here
and this is going to grab around 33 web
pages of documentation from for Lang
chain related to Lang chain expression
language okay and what I'm going to plot
here is a histogram of the token counts
of every page so a bunch are kind of
small that's find easy to work with so
less than 2,000 tokens a few are pretty
big so up to like 12,000
tokens so that kind of gives you a sense
of the distribution of pages that we
want to work with and we're going to
apply this approach to those pages um
now I'm going to use anthropics new
model to do that um and I'll use open I
embeddings so that's fine I set those
and now what I'm going to do so this
code was released uh by the authors of
the paper and I'm going to explain how
this works in a little bit but for right
now I'm just going to copy this over and
this is all going to be accessible to
you in the notebook that we're going to
make public uh so this is all the
clustering code and we're going to talk
about what it's doing later I added
comments and Doc strings to this um so
it's it's a little bit more
understandable
here's some code that I wrote um that
basically is going to do like
orchestrate the process of the cluster
summarize um and then like iteratively
perform that until you end up with a
single
cluster um so there we go I'm going to
copy this code over and I'm going to
kick this process off and then I'm going
to walk through it while this is running
so that's running now now first I want
to kind of explain how this clustering
process works it's it's kind of
interesting um so the idea actually
incorporates three important actually
Four important
points so it's using this GMM this
gussian mixture model to model the
distribution of the different clusters
so what's kind of cool about this
approach is that you don't actually tell
it group the data into some number of
clusters like you do some of approaches
you kind of set the parameter you want
like n clusters here um it actually can
kind of infer or deter determine the
optimal number of clusters and it uses
this like Bic again you can dig into
this in more detail if you want but the
intuition is that uh this approach will
kind of guess or attempt to to determine
the number of clusters automatically for
you um and it's also modeling the
distribution of your individual
documents across the
Clusters um it uses this umap or
basically it's a dimensionality
reduction approach to improve the
clustering process so if you want to
like really read into this that you
should actually just go and do that um
the intuition is that this actually
helps improve
clustering um it also does clustering
what they call like local and Global so
it tries to analyze the data at two
different scales um like kind of look at
like patterns kind of within smaller
groups and then like within the full
data set to try to improve how you're
going to group these documents uh
together and it applies thresholding to
assign the basically the group
assignment for every document or the
cluster assignment for every document so
this is really the
idea here's all my documents let's look
at this one what's happening is it's
using this GMM to basically assign of
probability that this document belongs
to each one of our clusters so like
here's cluster one here's cluster two
here's cluster three each cluster will
get a
probability and this thresholding then
is applied to those
probabilities such that a document can
actually belong to more than one cluster
so that's actually really nice cuz in a
lot of other approaches it's kind of
mutually exclusive so document can only
live in one or another cluster but with
this approach it can actually be long to
multiple clusters so that's like a nice
benefit of this
approach um I think that's kind of all I
want to say initially about this
clustering strategy uh but you should
absolutely have a look at the paper
which I uh will also ensure that we
link um so right now let's actually go
look at the code so we can see that it's
performing this this cluster
generation and let's actually look at
kind of what it's doing so it really
follows what we just talked
about we have a set of texts and in this
case my texts are just that those you
know those 33 web pages uh that I'm
passing in so we can actually look at
that so what I passed in these Leaf text
Leaf text I Define as my docs again
let's actually go back and look at our
diagram so we can like follow
along uh here we go so these leaves are
my web pages that's it so here's my leaf
text and you can see let's look at the
length there uh okay there's 31 of them
so that's
fine um and what's happening is those
first
get embedded as
expected and then so here's the
embeddings and then they get clustered
and this perform clustering is taken
directly from basically the results uh
or the code provided by the authors of
the paper so it's doing that process I
just talked about um of clustering
basically cluster assignment um and we
get our cluster labels out we put those
in a data frame um and so so then we
have our clusters you can see that here
and because each docking can belong to
more than one cluster we actually expand
out the data frame um so that the
cluster column um Can may contain
duplicates uh for a single document so
one document can live in multiple
clusters and we just flatten it out to
show that then all we do is we get the
whole list of clusters here um the
Define a summarization prompt pretty
simple and all we do is we have our data
frame Just Fish Out give me all the
texts within each cluster and that's all
we're doing here for each cluster get
all the text Plum it into our
summarization prompt generate the
summary here's our summary data frame
that's really it so again iterate
through our clusters get the text for
every cluster summarize it write that
out to a data frame and that's all we do
here's our cluster data frame here's our
summary data frame from that function um
and this is just orchestrating that
process of like iteration so we just
keep doing this until I provide like a
level or n levels parameter you can say
do this end times or um you know uh or
um the number of clusters is is equal to
one so so basically this is saying
continue until either we've done n
levels or like n number of of of
attempts um in our tree or the number of
clusters is one keep doing that and
that's it so you can see we've actually
run that process we have our results
Here and Now what we can do is pretty
simply um we can just put those
into uh an index like we can use chroma
as a vector store um so here's just some
really simple code to do that or just
iterating through our results we're
getting all our summaries out so first
maybe I should make this a little B more
clear we take all those raw
documents and we add we create like we
add those to our text lists we then get
all of our summaries from our tree we
add those and we just index all of them
so let's do that so these These are
going to all be added them to chroma and
very finally we can set up a retrieval
chain that is
using this index which contains both our
leaves so all those raw web pages and
these higher level summary pages that's
all that's happening here we pull in a
rag prompt um here's our retriever
here's our question so let's give this a
shot so this is running and I want to
just bring you back to the diagram so
again you can kind of see what's going
on
here
um right here so again we took our web
pages uh again 31 of them we cluster
them we summarize them we do that
iteratively um then what we do is we
take those summaries that we generated
along with the raw web pages and we
index all of them that's it and we can
use that index for retrieval so this is
like a nice what we might call a long
context index because it contains just
raw web pages which vary from 2,000 to
12,000 tokens and it contains in our
case these higher level summaries in
case we need an integration of
information across those pages um which
may or may not be captured just using
K&N retrieval so that's the big idea
okay this ran we got our answer we can
check Langs Smith and we can see here's
our
retriever um and let's see here's the
raw
documents so it looks like it retrieved
some kind of higher level summaries as
well as some raw leavs so this is like a
raw web page and then some of these are
more like summary Pages which looks like
we produced so what's kind of cool about
this is you can retrieve from a
combination of like your raw Pages as
well as these higher level summaries
which gives you some robustness and
cement the coverage for different types
of questions that require like different
resolutions of of abstraction or
detailed answer like a really detailed
code question you might retrieve
directly from your raw pages but like a
higher level question that integrates
information from a bunch of pages you
might retrieve from these midlevel or
even top level summaries so it's a cool
approach it integrates kind of nicely
with long context models and I know one
thing that will come up here is well
look your full context was only 60,000
tokens you could just stuff all of that
into one of these models you didn't need
to do any of this that is absolutely
true for this case but what I think the
high level point is that's not true for
every case for example this exact set of
documents I want to use with mraw mraw
is only 33 32,000 tokens so this is a
really nice approach for that case where
I can kind of guarantee that I can index
across all these pages but I won't
exceed the context limit or and likely
to exceed the context limit of my llm
because none of these individual Pages
exceed 32,000 tokens so you know you can
see and again this scale is arbitrarily
large so it is true that this set of
documents is only 62,000 tokens
but of course there's much larger
corpuses which could extend beyond even
the 200,000 of CLA 3 in which case this
type of approach of kind of indexing
across documents um and building these
like kind of mid-level high level
summaries can be applicable so it's a
cool method it's a neat paper um I
definitely encourage you to experiment
with it um and all this code will be
available um for you to to work with and
um I think that's about it thanks very
much
5.0 / 5 (0 votes)
Google Releases AI AGENT BUILDER! đ€ Worth The Wait?
đă2024ææ°ăClaude 3æäčæłšćïŒæłšćClaude 3äžæŹĄæćăClaude AIææ°æłšćæçšïŒć šéąè¶ è¶GPT-4ăGemini UltraçClaude 3 AIæäčçš | æ°ćç§æ°LC
Angular Material Tutorial - 3 - Material Module
Ollama Embedding: How to Feed Data to AI for Better Response?
NEW 2025 Tesla Model 3 PERFORMANCE Review: Supercar Slayer | 4K
37% Better Output with 15 Lines of Code - Llama 3 8B (Ollama) & 70B (Groq)