Building long context RAG with RAPTOR from scratch

LangChain
5 Mar 202421:30

Summary

TLDRThe video discusses the use of long-context LLMs for projects like code assistants, highlighting the benefits and limitations. It introduces a new method called Raptor for efficient retrieval with long-context models, which involves clustering and summarizing documents to create a hierarchical document tree for robust information retrieval. The approach is demonstrated using Anthropic's new model, CLA 3, and emphasizes its potential for handling large document sets while avoiding context limits.

Takeaways

  • 📚 Introduction of long-context LLMs and the Raptor retrieval method
  • 🀖 Use of long-context LLMs in projects like code assistants for specific programming languages
  • 💡 Importance of considering the cost and latency in long-context LLMs
  • 🔄 Evaluation of 20 questions in different runs showing variance in performance and cost
  • 🌐 Comparison between long-context LLMs and RAG systems for retrieval
  • 📈 Discussing the p50 and p99 latency metrics for understanding performance
  • 🔢 Consideration of using a local LLM like Mystiko 7B V2 with a 32,000 token context window
  • 🌳 Idea of building a document tree for efficient retrieval and summarization
  • 📊 Raptor's approach of recursively clustering and summarizing documents
  • 🔍 Utilizing KNN on embedded documents for retrieval without chunking
  • 🔗 Open-sourcing of Raptor's code and its potential integration with LLMs like CLAE 3

Q & A

  • What is the main topic discussed by Lance from Lang chain?

    -The main topic discussed by Lance is retrieval and long context LLMs, specifically focusing on a new method called Raptor.

  • What are some of the long context LLMs mentioned in the transcript?

    -Some of the long context LLMs mentioned include Gemini, Claude 3, and the use of a long context LLM for coding questions in the Lang chain expression language.

  • What are the considerations Lance wanted to point out regarding the use of long context LLMs?

    -Lance pointed out considerations such as the latency and cost associated with using long context LLMs, as well as the limitations when dealing with documents larger than the context window.

  • What is the Raptor method and how does it aim to address the limitations of long context LLMs?

    -The Raptor method is a retrieval strategy that builds a document tree, creating higher-level summaries of clusters to consolidate information from different documents. This approach allows for the integration of information across various documents, addressing the challenge of retrieving comprehensive answers using traditional KNN methods.

  • How does the Raptor method handle the retrieval of information from multiple documents?

    -The Raptor method handles retrieval by building a document tree structure where each document or set of documents is summarized and indexed. This allows for the retrieval of information not only from individual documents but also from higher-level summaries, capturing information from multiple sources when needed.

  • What is the significance of using a Gaussian Mixture Model (GMM) in the Raptor method?

    -The GMM is used to model the distribution of clusters and documents across the clusters. It allows for the automatic determination of the optimal number of clusters and assigns probabilities to each document for cluster membership, enabling a document to belong to multiple clusters if appropriate.

  • What is the role of UMAP in the Raptor clustering process?

    -UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique used in the Raptor method to improve the clustering process by better visualizing and understanding the relationships between data points in a lower-dimensional space.

  • How does the Raptor method deal with the challenge of setting the K parameter in KNN retrieval?

    -The Raptor method mitigates the challenge of setting the K parameter in KNN retrieval by building a document tree that includes higher-level summaries. This structure allows for the integration of information from multiple documents, ensuring that the retrieved information is comprehensive even if the K parameter is not perfectly set.

  • What is the advantage of using the Raptor method for long context LLMs?

    -The Raptor method is advantageous for long context LLMs as it provides a lightweight and easy-to-use retrieval strategy that can handle documents larger than the model's context window. It also offers robustness in retrieving answers that may require information integration from multiple documents.

  • What model did Lance decide to use for summarizing individual documents in the Raptor method?

    -Lance decided to use Anthropic's new model, CLA 3, for summarizing individual documents in the Raptor method due to its strong performance and suitability for handling large documents.

  • How does the Raptor method ensure that the retrieved information is comprehensive and relevant?

    -The Raptor method ensures comprehensive and relevant information retrieval by creating a document tree with higher-level summaries and indexing both the raw documents and these summaries. This allows for the possibility of retrieving from a broad pool of information, including detailed raw documents and consolidated summaries, to answer a wide range of questions effectively.

Outlines

00:00

🀖 Introduction to Retrieval and Long Context LLMs

Lance from Lang chain introduces the topic of retrieval and long context language models, specifically focusing on a new method called Raptor. He discusses the recent developments in long context LLMs like Gemini and Claude, and shares his experience using them for projects such as a code assistant. Lance highlights the benefits of using long context models for answering coding questions, emphasizing the efficiency of context-based answers without the need for retrieval. However, he also points out the need to consider the limitations and costs associated with using very long context models, leading to the exploration of retrieval strategies that can complement long context models.

05:01

📈 Evaluation and Considerations of Long Context LLMs

Lance presents an evaluation of long context LLMs by examining 20 questions and their generation times, noting the variance in latency and cost. He discusses the trade-offs between using long context models and RAG systems, highlighting the challenges of setting the K parameter for retrieval strategies. The conversation then shifts to the possibility of using a local LLM, such as the 7B V2 model, and the considerations that arise when dealing with document sizes larger than the model's context window. Lance shares his thoughts on the need for a lightweight retrieval strategy that can handle larger documents while still utilizing long context.

10:02

🌐 Raptor: A Retrieval Strategy for Long Context LLMs

Lance introduces Raptor, a recently published paper on a retrieval strategy for long context LLMs, which has been open-sourced and adapted into a tool by the llama index community. Raptor's approach involves clustering and summarizing documents at various levels, creating a document tree that allows for the integration of information from multiple sources. This method addresses the challenge of setting the K parameter in retrieval by building a hierarchy of document summaries, which can be used to retrieve information at different levels of abstraction. Lance emphasizes the potential of this approach to complement long context models and provide more robust retrieval across a range of questions.

15:04

📊 Implementing Raptor with Anthropics' New Model

Lance outlines the process of implementing Raptor using Anthropics' new model, CLA 3, to summarize individual documents without worrying about their size. He describes the process of clustering and summarizing web pages related to the Lang chain expression language, using a histogram to illustrate the token count distribution of each page. Lance explains the clustering process, which involves a Gaussian mixture model (GMM), UMAP for dimensionality reduction, and local and global clustering analysis. The process allows documents to belong to multiple clusters, enhancing the retrieval strategy. He shares the code used for clustering and summarizing, and discusses the benefits of building a document tree for retrieval purposes.

20:07

🔍 Retrieval with Raptor and Long Context Indexing

Lance concludes by discussing the application of Raptor for retrieval, emphasizing its compatibility with long context models. He explains how the Raptor approach allows for indexing of both raw documents and higher-level summaries, enabling the retrieval of information at different levels of detail. Lance demonstrates the setup of a retrieval chain using an index that contains raw web pages and summaries, and shares the results of a retrieval attempt that combines both raw and summary pages. He acknowledges that while the full context for this specific case could be handled by a single model, the Raptor approach is beneficial for cases where document sizes exceed the model's context limit, making it a versatile solution for various scenarios.

Mindmap

Keywords

💡Retrieval

In the context of the video, retrieval refers to the process of sourcing relevant information from a database or set of documents in response to a query or question. It is a critical component in information management and is used in conjunction with long-context LLMs (Language Models) to provide detailed and accurate answers. An example from the script is the discussion around using RAG (Retrieval-Augmented Generation) systems versus long-context LLMs for answering coding questions, where retrieval is a key factor in determining the efficiency and accuracy of the response.

💡Long-Context LLMs

Long-Context LLMs are language models capable of processing and generating text with a large amount of context, typically more than standard models. These models are beneficial for tasks requiring an understanding of extensive background information or complex topics. In the video, the speaker discusses using long-context LLMs for a code assistant project, where the model generates answers to coding questions based on extensive documentation.

💡Raptor

Raptor is a newly introduced method for information retrieval that focuses on creating a document tree for efficient summarization and information consolidation. This approach allows for the building of higher-level summaries from clusters of documents, which can then be used to perform retrieval more effectively, especially when dealing with large documents that exceed the context limit of a model. It is presented as a lightweight strategy that complements long-context models.

💡Latency

Latency in the context of the video refers to the delay or waiting time experienced during the process of information retrieval or generation of responses by LLMs. The speaker discusses the p50 and p99 latency metrics, which indicate the 50th and 99th percentile latency times for a set of generations, highlighting the performance considerations when using long-context LLMs.

💡Cost

In the context of the video, cost refers to the financial expenditure associated with running long-context LLMs for information retrieval and response generation. The speaker provides an example of the cost ranging from one to thirty dollars per generation, emphasizing the economic considerations when deploying such models.

💡Mistil 7B V2

Mistil 7B V2 is a specific long-context LLM mentioned in the video, which has a 32,000 token context window. It is used as an example of a model that falls short of the required context window for the speaker's documents, which are around 60,000 tokens, thus necessitating the exploration of alternative retrieval strategies like Raptor.

💡Document Level Indexing

Document Level Indexing is a retrieval strategy where entire documents are indexed and embedded directly, allowing for the retrieval of full documents in response to queries. This approach avoids the need for document chunking and can be used in conjunction with methods like KNN (K-Nearest Neighbors) for effective retrieval.

💡Document Tree

A Document Tree is a hierarchical structure used in the Raptor method to organize and summarize documents. It involves creating clusters of documents, summarizing the information within those clusters, and then recursively summarizing the clusters to create higher-level abstractions. This structure allows for the integration of information from multiple documents and facilitates more robust retrieval.

💡GMM (Gaussian Mixture Model)

GMM, or Gaussian Mixture Model, is a statistical model used in the Raptor method for clustering documents. It estimates the optimal number of clusters by modeling the distribution of documents and their probabilities of belonging to different clusters. This approach allows for documents to belong to multiple clusters, enhancing the clustering process and the subsequent retrieval.

💡UMAP (Uniform Manifold Approximation and Projection)

UMAP is a dimensionality reduction technique used in the Raptor method to improve the clustering process. It helps in visualizing and understanding the relationships between data points, in this case, documents, by reducing the complexity of the data while preserving its structure. This enhances the effectiveness of the clustering and summarization steps.

💡Thresholding

Thresholding is a technique used in the Raptor method to assign cluster membership to documents based on their probabilities of belonging to different clusters. By applying a threshold, documents can be assigned to multiple clusters, which increases the flexibility and effectiveness of the clustering process.

Highlights

Discussion of Retrieval and Long Context LLMs

Introduction of a new method called Raptor

Mention of the shift from RAG to long context LLMs like Gemini and Claude

Use of long context LLMs in projects such as a code assistant

Challenges with latency and cost in long context LLMs

Comparison between long context LLMs and RAG systems

Considerations for using long context LLMs

Evaluation of 20 questions using long context LLMs

Discussion on the possibility of using a local LLM like Mistol 7B V2

Proposal of minimalist splitting strategies for long context LLMs

Idea of indexing at the document level for retrieval

Concept of building a document tree for retrieval

Overview of the Raptor paper and its open-sourcing

Explanation of the clustering process in Raptor

Use of Gaussian Mixture Model (GMM) in Raptor's clustering

Integration of UMAP for dimensionality reduction in clustering

Description of the iterative summarization process in Raptor

Creation of a long-context index for retrieval using Raptor

Demonstration of retrieval using both raw documents and higher-level summaries

Potential application of Raptor for documents exceeding the context limit of LLMs

Transcripts

00:01

hi this is Lance from Lang chain I'm

00:04

going to be talking about retrieval and

00:06

long context llms and a new method

00:08

called

00:09

Raptor so over the last few weeks

00:12

there's been a lot of talk about is rag

00:14

Dead with the Advent of new long Contex

00:16

llms like Gemini a million tokens Claude

00:20

3 now with up to a million tokens it's

00:22

an interesting

00:24

question um I've recently been using

00:26

long Contex llms for certain projects

00:29

like like for example this code

00:31

assistant that I put out last week

00:34

basically used a long context llm to

00:36

answer coding questions about our docs

00:39

on L expression language I'll kind of

00:42

zoom in here so you can see it um so

00:44

these are around 60,000 tokens of

00:46

context we take the question we take the

00:50

docs we produce an answer and this is

00:52

really nice no retrieval required just

00:56

context stuff all these docs and perform

00:58

answer generation directly

01:00

so I'm a big fan of using La context

01:02

llms in this

01:04

way but there are some considerations I

01:06

wanted to like to to kind of point out

01:09

here so I ran evaluations and for those

01:12

evaluations I look at 20 questions um so

01:16

basically it's 20

01:17

Generations now look here so this is the

01:20

Langs Smith dashboard that I used for

01:23

those EV vals and you can see something

01:25

kind of interesting the p50 latency

01:28

tells you the 50th percentile latency

01:31

for each of those Generations um so

01:33

again remember there's

01:35

20 so it's around 35 to you know 46

01:39

seconds depending on the the trial this

01:41

is on the same data set same 20

01:44

Questions there's some variance run to

01:46

run so that's kind of expected and again

01:47

the P99 it's up to like okay 420 seconds

01:51

that's really long in that

01:52

case but maybe more interestingly if you

01:55

look at the cost again there's 20

01:57

questions so the cost is ranging from

02:00

maybe like a dollar to you know a dollar

02:02

a dollar like 30 per

02:04

generation so you know C and Lanes your

02:07

things to think about when you're

02:08

talking about using really long Contex

02:10

llms as opposed to like a rag system

02:13

where you're per you're performing

02:15

retrieval of much smaller more directed

02:17

chunks to your

02:19

question now the other thing that came

02:22

up is a lot of people asked hey can you

02:25

swap this out and use a local

02:28

llm and my go-to local llm is mistol 7B

02:32

V2 which actually has a 32,000 token

02:36

context window but that's still a little

02:38

bit big relative to my docs which are

02:41

around 60,000 tokens so you know I

02:44

couldn't just context stuff them as I

02:47

did here so these three considerations

02:51

kind of led me to think

02:53

about I really like working with long

02:55

context models and it's absolutely going

02:57

to be the continuing thing but are there

03:00

retrieval strategies that are like

03:02

lightweight easy to use with long

03:04

context models um that kind of like

03:08

preserve the ability to utilize a lot of

03:11

context uh but can address some of these

03:14

limitations um in particular this last

03:17

piece was important because this is

03:19

something I want to do kind of in the

03:20

near term and I need kind of like a nice

03:22

lightweight retrieval strategy that

03:25

still uses long context but can operate

03:28

in cases where my documents are maybe

03:31

just a little bit bigger than my context

03:33

window in this case like around

03:35

2x so I kind of put this out on Twitter

03:37

and said hey has anyone come across like

03:40

good like maybe minimalist splitting

03:43

strategies for long contuct LMS you know

03:45

like I wanted to graag with mrol 7B with

03:48

a 32,000 token context window but my

03:50

docs are 60,000 tokens I can't just

03:53

context stuff them but I also don't want

03:57

some like very fine scale chunking thck

03:59

thing like I get it we don't want to

04:01

mess with all that we want something

04:02

simple that just can like kind of work

04:04

across larger

04:06

documents so one point that was raised

04:08

which is a really good one is well just

04:13

um just index at the document level so

04:15

you can take full documents and just

04:17

embed them directly it's a fair point

04:20

and then you do something like KNN on

04:22

those embedded documents so again no

04:24

chunking of any documents no splitting

04:26

of documents you have your set of

04:28

documents embedded one and just retrieve

04:31

at the document level that's a pretty

04:33

good idea that's pretty

04:35

reasonable another idea that came up

04:38

though is this idea of building a

04:40

document tree and part of the reason for

04:43

that is when you talk about something

04:45

like KNN or like you know K nearest

04:48

neighbor retrieval on a set of embedded

04:50

documents it is true that sometimes an

04:52

answer requires maybe two or three

04:54

different documents kind of integrated

04:56

in order to answer it now if you context

04:59

St everything that's not a problem

05:01

because it's all there if you're doing

05:03

retrieval well you're setting your K

05:05

parameter to be some value it's kind of

05:08

brittle do you need to be like four or

05:10

five or six to capture all the context

05:12

needed for certain particular questions

05:14

so it's kind of hard to set that so this

05:17

idea of building a documentary is kind

05:19

of an interesting way to potentially

05:21

address this challenge with like basic

05:24

Cann so a paper Raptor came out recently

05:29

on this exact

05:30

idea um and their code recently open

05:33

sourced which led the folks at llama

05:35

index to come out with a llama pack for

05:36

it which is great um and the idea is

05:40

pretty interesting so I wanted to kind

05:42

of lay it out here and talk about how it

05:44

might benefit this exact case of kind of

05:46

long context

05:48

retrieval so the intuition is pretty

05:51

simple First We Take a set of documents

05:55

now note that these documents can be any

05:58

sized so in their case they're just

06:00

chunks so they're like 100 tokens but it

06:04

doesn't matter so we start with a set of

06:06

raw documents now what we do is we embed

06:10

them and then we cluster them so this

06:13

clustering process groups together like

06:15

documents and then we do one important

06:18

thing we summarize information in that

06:20

cluster into what we call kind of like a

06:23

more abstract or higher level summary of

06:25

that

06:26

content and we do that recursively until

06:29

we end up with one cluster that's it so

06:32

what's happening is you're starting with

06:34

the set of what they call leaves or like

06:36

raw documents you do a

06:38

grouping uh via clustering you do a

06:41

summarization steps you're kind of

06:42

compressing and then you do it again and

06:45

the idea is that these kind of midlevel

06:49

or eventually like root level or highest

06:51

level summaries can consolidate

06:53

information from different places in

06:54

your documents now what they do is they

06:58

basically just embed those summaries

07:00

along with the raw leavs and they

07:01

perform retrieval and we'll talk about

07:03

that a little bit later but what they

07:05

show is actually just doing retrieval on

07:08

all of these together like as a whole

07:10

pool performs best um and that's kind of

07:13

a nice result it's pretty easy then to

07:16

basically just index that and and use

07:19

it I will make a note that their paper

07:22

talked about you know these leavs being

07:24

chunks which I didn't love because look

07:26

I want to work with long context models

07:29

and like I don't want to deal with

07:30

chunking at all and I've replied you

07:33

know I replied to Jerry's tweet on this

07:35

and Jerry made a fair point that you

07:36

know this can scale to anything so for

07:40

example those leavs can be full

07:43

documents they don't have to be chunks

07:45

that's completely reasonable Point um so

07:48

again you can kind of think about this

07:49

as if idea one was let's just take each

07:52

document and embed it idea two is well

07:56

let's embed each document like we did

07:58

and we can also build kind of like a a

08:00

document abstraction Tree on top and

08:02

embed those so we have these like higher

08:04

level summaries in our embeddings which

08:07

we can retrieve from if we need an

08:08

answer to conate information from like a

08:10

small set of documents right so it's a

08:13

little bit more robust maybe to this

08:15

problem which is that if I'm just doing

08:18

KNN on Raw documents and I need

08:20

information from like two or three

08:21

documents I'm not guaranteed to always

08:24

get that because of this K parameter

08:26

that I set I'm only retrieving k docks

08:30

whereas here I'm building these docks

08:33

that contain information from multiple

08:36

leaves or multiple you know suboc so to

08:39

speak um and it can actually just

08:41

capture that information uh in in a in

08:44

kind of a a nice way um such that it can

08:49

it can basically integrate information

08:51

across different individual leads or

08:53

individual documents so that's the key

08:56

Point um and so we can you can kind of

08:59

see when you think about like working

09:01

long context models of course context

09:03

stuffing is a great option if you can do

09:04

it but there are some other interesting

09:07

ideas one is actually just embedding

09:09

full documents and another is this idea

09:10

of again documents and an abstraction

09:13

tree so let's go ahead and just build

09:16

Raptor because it's pretty interesting

09:18

and to do this I'm actually going to

09:20

going to look at clae 3 which just came

09:22

out today it's a new set of model Str

09:24

anthropic really strong performance and

09:26

should be really good for this use case

09:29

because what I want to do is I want to

09:31

perform summaries of individual

09:33

documents and I don't really want to

09:34

worry about the size of those

09:36

documents um so I'm going to use the

09:39

same set of documents that I previously

09:42

did with the code generation example

09:45

that video came out last week and I have

09:46

an empty notebook here um it we just do

09:50

a few pip installs I'm setting a few

09:52

environment variables for lsmith and now

09:54

I'm just going to say grab my docs so

09:57

that's right here

09:59

and this is going to grab around 33 web

10:02

pages of documentation from for Lang

10:04

chain related to Lang chain expression

10:06

language okay and what I'm going to plot

10:09

here is a histogram of the token counts

10:11

of every page so a bunch are kind of

10:13

small that's find easy to work with so

10:15

less than 2,000 tokens a few are pretty

10:17

big so up to like 12,000

10:20

tokens so that kind of gives you a sense

10:22

of the distribution of pages that we

10:23

want to work with and we're going to

10:25

apply this approach to those pages um

10:28

now I'm going to use anthropics new

10:30

model to do that um and I'll use open I

10:32

embeddings so that's fine I set those

10:35

and now what I'm going to do so this

10:38

code was released uh by the authors of

10:40

the paper and I'm going to explain how

10:42

this works in a little bit but for right

10:45

now I'm just going to copy this over and

10:46

this is all going to be accessible to

10:47

you in the notebook that we're going to

10:48

make public uh so this is all the

10:51

clustering code and we're going to talk

10:52

about what it's doing later I added

10:54

comments and Doc strings to this um so

10:57

it's it's a little bit more

10:58

understandable

10:59

here's some code that I wrote um that

11:01

basically is going to do like

11:03

orchestrate the process of the cluster

11:06

summarize um and then like iteratively

11:08

perform that until you end up with a

11:10

single

11:11

cluster um so there we go I'm going to

11:14

copy this code over and I'm going to

11:16

kick this process off and then I'm going

11:18

to walk through it while this is running

11:20

so that's running now now first I want

11:23

to kind of explain how this clustering

11:24

process works it's it's kind of

11:26

interesting um so the idea actually

11:31

incorporates three important actually

11:33

Four important

11:35

points so it's using this GMM this

11:37

gussian mixture model to model the

11:40

distribution of the different clusters

11:43

so what's kind of cool about this

11:45

approach is that you don't actually tell

11:47

it group the data into some number of

11:50

clusters like you do some of approaches

11:52

you kind of set the parameter you want

11:54

like n clusters here um it actually can

11:57

kind of infer or deter determine the

11:59

optimal number of clusters and it uses

12:01

this like Bic again you can dig into

12:03

this in more detail if you want but the

12:05

intuition is that uh this approach will

12:08

kind of guess or attempt to to determine

12:10

the number of clusters automatically for

12:13

you um and it's also modeling the

12:16

distribution of your individual

12:18

documents across the

12:20

Clusters um it uses this umap or

12:24

basically it's a dimensionality

12:25

reduction approach to improve the

12:27

clustering process so if you want to

12:29

like really read into this that you

12:31

should actually just go and do that um

12:33

the intuition is that this actually

12:35

helps improve

12:36

clustering um it also does clustering

12:39

what they call like local and Global so

12:41

it tries to analyze the data at two

12:43

different scales um like kind of look at

12:46

like patterns kind of within smaller

12:48

groups and then like within the full

12:50

data set to try to improve how you're

12:52

going to group these documents uh

12:54

together and it applies thresholding to

12:59

assign the basically the group

13:01

assignment for every document or the

13:03

cluster assignment for every document so

13:05

this is really the

13:07

idea here's all my documents let's look

13:10

at this one what's happening is it's

13:12

using this GMM to basically assign of

13:15

probability that this document belongs

13:17

to each one of our clusters so like

13:20

here's cluster one here's cluster two

13:22

here's cluster three each cluster will

13:24

get a

13:25

probability and this thresholding then

13:27

is applied to those

13:29

probabilities such that a document can

13:32

actually belong to more than one cluster

13:35

so that's actually really nice cuz in a

13:37

lot of other approaches it's kind of

13:39

mutually exclusive so document can only

13:41

live in one or another cluster but with

13:43

this approach it can actually be long to

13:44

multiple clusters so that's like a nice

13:46

benefit of this

13:48

approach um I think that's kind of all I

13:51

want to say initially about this

13:53

clustering strategy uh but you should

13:56

absolutely have a look at the paper

13:58

which I uh will also ensure that we

14:01

link um so right now let's actually go

14:05

look at the code so we can see that it's

14:07

performing this this cluster

14:09

generation and let's actually look at

14:11

kind of what it's doing so it really

14:14

follows what we just talked

14:16

about we have a set of texts and in this

14:20

case my texts are just that those you

14:22

know those 33 web pages uh that I'm

14:26

passing in so we can actually look at

14:27

that so what I passed in these Leaf text

14:31

Leaf text I Define as my docs again

14:33

let's actually go back and look at our

14:35

diagram so we can like follow

14:37

along uh here we go so these leaves are

14:41

my web pages that's it so here's my leaf

14:44

text and you can see let's look at the

14:46

length there uh okay there's 31 of them

14:50

so that's

14:51

fine um and what's happening is those

14:57

first

14:59

get embedded as

15:01

expected and then so here's the

15:04

embeddings and then they get clustered

15:06

and this perform clustering is taken

15:08

directly from basically the results uh

15:10

or the code provided by the authors of

15:12

the paper so it's doing that process I

15:15

just talked about um of clustering

15:19

basically cluster assignment um and we

15:22

get our cluster labels out we put those

15:24

in a data frame um and so so then we

15:28

have our clusters you can see that here

15:31

and because each docking can belong to

15:34

more than one cluster we actually expand

15:36

out the data frame um so that the

15:40

cluster column um Can may contain

15:45

duplicates uh for a single document so

15:48

one document can live in multiple

15:50

clusters and we just flatten it out to

15:51

show that then all we do is we get the

15:55

whole list of clusters here um the

15:58

Define a summarization prompt pretty

16:00

simple and all we do is we have our data

16:02

frame Just Fish Out give me all the

16:05

texts within each cluster and that's all

16:07

we're doing here for each cluster get

16:10

all the text Plum it into our

16:12

summarization prompt generate the

16:14

summary here's our summary data frame

16:17

that's really it so again iterate

16:19

through our clusters get the text for

16:22

every cluster summarize it write that

16:25

out to a data frame and that's all we do

16:28

here's our cluster data frame here's our

16:30

summary data frame from that function um

16:33

and this is just orchestrating that

16:35

process of like iteration so we just

16:37

keep doing this until I provide like a

16:41

level or n levels parameter you can say

16:42

do this end times or um you know uh or

16:48

um the number of clusters is is equal to

16:51

one so so basically this is saying

16:54

continue until either we've done n

16:56

levels or like n number of of of

16:59

attempts um in our tree or the number of

17:02

clusters is one keep doing that and

17:04

that's it so you can see we've actually

17:05

run that process we have our results

17:08

Here and Now what we can do is pretty

17:12

simply um we can just put those

17:15

into uh an index like we can use chroma

17:18

as a vector store um so here's just some

17:21

really simple code to do that or just

17:22

iterating through our results we're

17:24

getting all our summaries out so first

17:27

maybe I should make this a little B more

17:28

clear we take all those raw

17:31

documents and we add we create like we

17:34

add those to our text lists we then get

17:36

all of our summaries from our tree we

17:38

add those and we just index all of them

17:40

so let's do that so these These are

17:43

going to all be added them to chroma and

17:46

very finally we can set up a retrieval

17:48

chain that is

17:50

using this index which contains both our

17:53

leaves so all those raw web pages and

17:56

these higher level summary pages that's

17:58

all that's happening here we pull in a

18:00

rag prompt um here's our retriever

18:04

here's our question so let's give this a

18:05

shot so this is running and I want to

18:08

just bring you back to the diagram so

18:10

again you can kind of see what's going

18:11

on

18:12

here

18:14

um right here so again we took our web

18:19

pages uh again 31 of them we cluster

18:23

them we summarize them we do that

18:25

iteratively um then what we do is we

18:28

take those summaries that we generated

18:31

along with the raw web pages and we

18:33

index all of them that's it and we can

18:36

use that index for retrieval so this is

18:38

like a nice what we might call a long

18:40

context index because it contains just

18:43

raw web pages which vary from 2,000 to

18:45

12,000 tokens and it contains in our

18:48

case these higher level summaries in

18:51

case we need an integration of

18:52

information across those pages um which

18:55

may or may not be captured just using

18:58

K&N retrieval so that's the big idea

19:01

okay this ran we got our answer we can

19:03

check Langs Smith and we can see here's

19:07

our

19:07

retriever um and let's see here's the

19:11

raw

19:12

documents so it looks like it retrieved

19:15

some kind of higher level summaries as

19:18

well as some raw leavs so this is like a

19:21

raw web page and then some of these are

19:24

more like summary Pages which looks like

19:26

we produced so what's kind of cool about

19:29

this is you can retrieve from a

19:31

combination of like your raw Pages as

19:35

well as these higher level summaries

19:37

which gives you some robustness and

19:38

cement the coverage for different types

19:40

of questions that require like different

19:42

resolutions of of abstraction or

19:44

detailed answer like a really detailed

19:47

code question you might retrieve

19:50

directly from your raw pages but like a

19:52

higher level question that integrates

19:54

information from a bunch of pages you

19:56

might retrieve from these midlevel or

19:58

even top level summaries so it's a cool

20:00

approach it integrates kind of nicely

20:03

with long context models and I know one

20:06

thing that will come up here is well

20:08

look your full context was only 60,000

20:11

tokens you could just stuff all of that

20:13

into one of these models you didn't need

20:15

to do any of this that is absolutely

20:18

true for this case but what I think the

20:20

high level point is that's not true for

20:22

every case for example this exact set of

20:26

documents I want to use with mraw mraw

20:30

is only 33 32,000 tokens so this is a

20:34

really nice approach for that case where

20:36

I can kind of guarantee that I can index

20:39

across all these pages but I won't

20:41

exceed the context limit or and likely

20:43

to exceed the context limit of my llm

20:46

because none of these individual Pages

20:48

exceed 32,000 tokens so you know you can

20:51

see and again this scale is arbitrarily

20:54

large so it is true that this set of

20:55

documents is only 62,000 tokens

20:58

but of course there's much larger

20:59

corpuses which could extend beyond even

21:01

the 200,000 of CLA 3 in which case this

21:05

type of approach of kind of indexing

21:07

across documents um and building these

21:10

like kind of mid-level high level

21:11

summaries can be applicable so it's a

21:14

cool method it's a neat paper um I

21:16

definitely encourage you to experiment

21:18

with it um and all this code will be

21:21

available um for you to to work with and

21:25

um I think that's about it thanks very

21:28

much