Building long context RAG with RAPTOR from scratch

LangChain
5 Mar 202421:30

Summary

TLDRLance 讨论了长上下文语言模型(LLMs)在检索和处理大型文档时的挑战。他介绍了一种新方法 Raptor,该方法通过聚类和摘要技术,创建了一个文档树,以有效地检索和整合信息。这种方法允许在不超出模型上下文限制的情况下,处理比单个模型窗口更大的文档集合。Lance 通过实验展示了 Raptor 在处理长文档时的有效性,并提供了代码和工具,以便观众可以尝试和应用这种方法。

Takeaways

  • 🔍 Lance 介绍了一种名为 Raptor 的新方法,用于处理长文本的检索和长上下文语言模型(LLM)。
  • 📈 他讨论了长上下文 LLM 如 Gemini 和 Claude 3 的优势和成本问题。
  • 📝 Lance 使用长上下文 LLM 来构建代码助手,无需检索即可直接生成答案。
  • ⏱️ 他提到长上下文生成的成本和延迟问题,以及与 RAG 系统相比的效率。
  • 🤔 Lance 探讨了使用本地 LLM 的可能性,如 32,000 标记窗口的 Mistol 7B V2。
  • 🌳 他提出了一种构建文档树的方法,以解决长文档检索的挑战。
  • 📚 Raptor 方法通过聚类和递归摘要来构建文档的层次结构。
  • 🔢 Raptor 使用高斯混合模型(GMM)来确定聚类的最佳数量。
  • 🔄 Lance 展示了如何使用 Anthropic 的新模型进行文档摘要和检索。
  • 🔍 他强调了 Raptor 方法在处理超出 LLM 上下文限制的大型文档时的适用性。
  • 📊 Lance 通过实验展示了 Raptor 方法在检索过程中如何结合原始文档和摘要文档。
  • 💡 他鼓励观众尝试 Raptor 方法,并提供了相关代码以便进一步探索。

Q & A

  • Lance 在视频中提到了哪些长文本LLMs?

    -Lance 提到了Gemini、Claude 3以及Anthropic的CLAE 3。

  • Lance 使用长文本LLMs的目的是什么?

    -Lance 使用长文本LLMs来创建一个代码助手,该助手能够回答有关Lang chain表达语言的文档的编码问题。

  • Lance 在评估长文本LLMs时考虑了哪些因素?

    -Lance 在评估时考虑了p50延迟、p99延迟和成本。

  • Lance 提到的Raptor方法是什么?

    -Raptor是一种新的检索策略,它通过嵌入和聚类文档,然后递归地总结信息,构建一个文档树,以便于在长文本LLMs中进行检索。

  • Raptor方法如何帮助解决长文本LLMs的局限性?

    -Raptor通过构建文档树和摘要,允许在不分割文档的情况下进行检索,同时能够整合来自多个文档的信息,解决了KNN等传统检索方法在处理长文本时可能遇到的问题。

  • Lance 在视频中提到了哪些关于Raptor的实现细节?

    -Lance 提到了使用高斯混合模型(GMM)来确定聚类数量,使用UMAP进行降维,以及在聚类过程中应用阈值来允许文档属于多个聚类。

  • Lance 如何处理超过长文本LLMs上下文窗口大小的文档?

    -Lance 使用Raptor方法,通过嵌入整个文档并构建文档树,来确保即使文档大小超过LLMs的上下文窗口,也能有效地进行检索。

  • Lance 在视频中提到的KNN是什么?

    -KNN(K最近邻)是一种基于距离的检索方法,它根据文档之间的相似度来检索最接近的K个文档。

  • Lance 在视频中提到的Anthropic的CLAE 3模型有什么特点?

    -Anthropic的CLAE 3模型是一个新发布的模型,具有强大的性能,适合用于长文本的总结任务。

  • Lance 在视频中提到了哪些关于文档树构建的步骤?

    -文档树构建的步骤包括:嵌入原始文档,通过聚类分组文档,对每个聚类进行总结,然后递归地执行这些步骤,直到得到一个单一的聚类。

  • Lance 在视频中提到的文档树检索的优势是什么?

    -文档树检索的优势在于它能够整合来自多个文档的信息,提供了一种更健壮的方法来处理需要跨多个文档整合信息的问题,同时能够适应不同类型问题的检索需求。

Outlines

00:00

🗣️ 兰斯讲述长文本LLMs和Raptor方法

兰斯介绍了长文本语言模型(LLMs)在项目中的应用,如代码助手,它利用长文本LLM直接生成答案,无需检索。他提到了使用长文本LLM的一些考虑因素,包括成本和响应时间,并探讨了是否可以用本地LLM替换。他还提出了一种新方法Raptor,这是一种轻量级的检索策略,适用于长文本模型。

05:01

🔍 探索Raptor:一种长文本检索策略

Raptor是一种新方法,它通过嵌入和聚类文档,然后递归地总结信息,构建一个文档树。这种方法可以整合来自多个文档的信息,解决了KNN检索中可能无法获取所有所需信息的问题。Raptor通过嵌入文档和摘要来执行检索,展示了在长文本模型中整合信息的能力。

10:02

📊 分析文档分布并应用Raptor方法

兰斯分析了一组文档的令牌计数分布,并使用Anthropic的新模型来执行文档的嵌入和总结。他解释了聚类过程,包括使用高斯混合模型(GMM)来确定聚类数量,以及如何通过UMAP和阈值化来改善聚类。这个过程允许文档属于多个聚类,提供了灵活性。

15:04

🛠️ 实施Raptor并构建索引

兰斯展示了如何将Raptor方法应用于实际文档,通过迭代聚类和总结来构建文档树。他创建了一个索引,包含了原始文档和所有级别的摘要,这使得检索可以从原始页面和摘要页面中进行,提供了不同类型问题的解决方案。

20:07

📘 Raptor方法的适用性和未来展望

兰斯强调,尽管在当前案例中,文档的总令牌数没有超过60,000的限制,但Raptor方法对于那些可能超出LLM上下文限制的更大文档集也是适用的。他鼓励人们尝试这种方法,并指出所有相关代码都将公开,以便人们可以进行实验。

Mindmap

Keywords

💡检索(Retrieval)

在视频中,检索指的是从大量文档中快速找到与问题相关的信息的过程。这通常涉及到使用特定的算法或模型来定位和提取数据。例如,Lance提到了使用长文本模型(long context LLMs)进行检索,以及如何通过Raptor方法来优化这一过程。

💡长文本模型(Long Context LLMs)

长文本模型是一种能够处理和理解大量文本数据的人工智能模型。在视频中,Lance提到了使用这类模型来回答问题,如在代码助手项目中,模型能够直接在大量上下文中生成答案,而无需额外的检索。

💡Raptor

Raptor是一种新的方法,用于处理和检索长文本信息。它通过构建文档树(document tree)来组织和总结信息,使得即使在文档大小超过模型上下文窗口的情况下,也能够有效地检索和整合信息。

💡文档树(Document Tree)

文档树是一种数据结构,用于组织和表示文档之间的关系。在Raptor方法中,文档树通过递归地对文档进行聚类和总结,创建出不同层次的摘要,从而帮助模型更好地理解和检索信息。

💡聚类(Clustering)

聚类是一种将数据分组的统计过程,使得同一组(或簇)内的数据点之间相似度较高,而不同组之间的数据点相似度较低。在视频中,聚类用于将文档分组,以便进行更有效的信息总结和检索。

💡摘要(Summarization)

摘要是从长文本中提取关键信息并生成简短版本的过程。在Raptor方法中,摘要用于创建文档树中的高层次节点,这些节点包含了来自多个文档的信息,有助于模型在需要时整合这些信息。

💡上下文窗口(Context Window)

上下文窗口是指模型能够处理的文本长度范围。在长文本模型中,上下文窗口的大小决定了模型能够一次性理解和生成答案的文本量。视频中提到了如何通过Raptor方法来处理超出上下文窗口大小的文档。

💡K最近邻(KNN)

K最近邻(K-Nearest Neighbors)是一种基于实例的监督学习算法,用于分类和回归。在视频中,KNN被提及作为一种检索策略,用于在嵌入的文档集合中找到与查询最相似的文档。

💡成本(Cost)

在视频中,成本指的是使用长文本模型进行检索和生成答案时所需的资源消耗,如计算资源和时间。Lance提到了在进行20个问题的回答时,成本从1美元到30美元不等。

💡性能(Performance)

性能在这里指的是模型在处理特定任务时的效率和效果。Lance在视频中讨论了不同模型和方法在处理长文本时的性能,包括响应时间和成本。

Highlights

Lance from Lang chain discusses retrieval and long context LLMs.

New method called Raptor is introduced for long context LLMs.

Long context LLMs like Gemini and Claude 3 can handle up to a million tokens.

Lance used long context LLMs for a code assistant project.

Long context LLMs can directly perform answer generation without retrieval.

Evaluations were run on 20 questions using long context LLMs.

P50 latency and P99 latency were measured for the evaluations.

Cost per generation for long context LLMs was discussed.

Considerations for using long context LLMs versus RAG systems were highlighted.

Mistol 7B V2, a local LLM with a 32,000 token context window, was mentioned.

The idea of indexing at the document level was proposed for retrieval.

Building a document tree for retrieval was suggested to address limitations.

Raptor's approach to clustering and summarizing documents was explained.

Raptor allows for embedding full documents and building a document abstraction tree.

The process of clustering, summarizing, and indexing documents was demonstrated.

Raptor's method can integrate information across different documents for retrieval.

The Raptor approach is applicable for cases where document size exceeds the context limit of LLMs.

The Raptor code and process will be made public for experimentation.

Transcripts

00:01

hi this is Lance from Lang chain I'm

00:04

going to be talking about retrieval and

00:06

long context llms and a new method

00:08

called

00:09

Raptor so over the last few weeks

00:12

there's been a lot of talk about is rag

00:14

Dead with the Advent of new long Contex

00:16

llms like Gemini a million tokens Claude

00:20

3 now with up to a million tokens it's

00:22

an interesting

00:24

question um I've recently been using

00:26

long Contex llms for certain projects

00:29

like like for example this code

00:31

assistant that I put out last week

00:34

basically used a long context llm to

00:36

answer coding questions about our docs

00:39

on L expression language I'll kind of

00:42

zoom in here so you can see it um so

00:44

these are around 60,000 tokens of

00:46

context we take the question we take the

00:50

docs we produce an answer and this is

00:52

really nice no retrieval required just

00:56

context stuff all these docs and perform

00:58

answer generation directly

01:00

so I'm a big fan of using La context

01:02

llms in this

01:04

way but there are some considerations I

01:06

wanted to like to to kind of point out

01:09

here so I ran evaluations and for those

01:12

evaluations I look at 20 questions um so

01:16

basically it's 20

01:17

Generations now look here so this is the

01:20

Langs Smith dashboard that I used for

01:23

those EV vals and you can see something

01:25

kind of interesting the p50 latency

01:28

tells you the 50th percentile latency

01:31

for each of those Generations um so

01:33

again remember there's

01:35

20 so it's around 35 to you know 46

01:39

seconds depending on the the trial this

01:41

is on the same data set same 20

01:44

Questions there's some variance run to

01:46

run so that's kind of expected and again

01:47

the P99 it's up to like okay 420 seconds

01:51

that's really long in that

01:52

case but maybe more interestingly if you

01:55

look at the cost again there's 20

01:57

questions so the cost is ranging from

02:00

maybe like a dollar to you know a dollar

02:02

a dollar like 30 per

02:04

generation so you know C and Lanes your

02:07

things to think about when you're

02:08

talking about using really long Contex

02:10

llms as opposed to like a rag system

02:13

where you're per you're performing

02:15

retrieval of much smaller more directed

02:17

chunks to your

02:19

question now the other thing that came

02:22

up is a lot of people asked hey can you

02:25

swap this out and use a local

02:28

llm and my go-to local llm is mistol 7B

02:32

V2 which actually has a 32,000 token

02:36

context window but that's still a little

02:38

bit big relative to my docs which are

02:41

around 60,000 tokens so you know I

02:44

couldn't just context stuff them as I

02:47

did here so these three considerations

02:51

kind of led me to think

02:53

about I really like working with long

02:55

context models and it's absolutely going

02:57

to be the continuing thing but are there

03:00

retrieval strategies that are like

03:02

lightweight easy to use with long

03:04

context models um that kind of like

03:08

preserve the ability to utilize a lot of

03:11

context uh but can address some of these

03:14

limitations um in particular this last

03:17

piece was important because this is

03:19

something I want to do kind of in the

03:20

near term and I need kind of like a nice

03:22

lightweight retrieval strategy that

03:25

still uses long context but can operate

03:28

in cases where my documents are maybe

03:31

just a little bit bigger than my context

03:33

window in this case like around

03:35

2x so I kind of put this out on Twitter

03:37

and said hey has anyone come across like

03:40

good like maybe minimalist splitting

03:43

strategies for long contuct LMS you know

03:45

like I wanted to graag with mrol 7B with

03:48

a 32,000 token context window but my

03:50

docs are 60,000 tokens I can't just

03:53

context stuff them but I also don't want

03:57

some like very fine scale chunking thck

03:59

thing like I get it we don't want to

04:01

mess with all that we want something

04:02

simple that just can like kind of work

04:04

across larger

04:06

documents so one point that was raised

04:08

which is a really good one is well just

04:13

um just index at the document level so

04:15

you can take full documents and just

04:17

embed them directly it's a fair point

04:20

and then you do something like KNN on

04:22

those embedded documents so again no

04:24

chunking of any documents no splitting

04:26

of documents you have your set of

04:28

documents embedded one and just retrieve

04:31

at the document level that's a pretty

04:33

good idea that's pretty

04:35

reasonable another idea that came up

04:38

though is this idea of building a

04:40

document tree and part of the reason for

04:43

that is when you talk about something

04:45

like KNN or like you know K nearest

04:48

neighbor retrieval on a set of embedded

04:50

documents it is true that sometimes an

04:52

answer requires maybe two or three

04:54

different documents kind of integrated

04:56

in order to answer it now if you context

04:59

St everything that's not a problem

05:01

because it's all there if you're doing

05:03

retrieval well you're setting your K

05:05

parameter to be some value it's kind of

05:08

brittle do you need to be like four or

05:10

five or six to capture all the context

05:12

needed for certain particular questions

05:14

so it's kind of hard to set that so this

05:17

idea of building a documentary is kind

05:19

of an interesting way to potentially

05:21

address this challenge with like basic

05:24

Cann so a paper Raptor came out recently

05:29

on this exact

05:30

idea um and their code recently open

05:33

sourced which led the folks at llama

05:35

index to come out with a llama pack for

05:36

it which is great um and the idea is

05:40

pretty interesting so I wanted to kind

05:42

of lay it out here and talk about how it

05:44

might benefit this exact case of kind of

05:46

long context

05:48

retrieval so the intuition is pretty

05:51

simple First We Take a set of documents

05:55

now note that these documents can be any

05:58

sized so in their case they're just

06:00

chunks so they're like 100 tokens but it

06:04

doesn't matter so we start with a set of

06:06

raw documents now what we do is we embed

06:10

them and then we cluster them so this

06:13

clustering process groups together like

06:15

documents and then we do one important

06:18

thing we summarize information in that

06:20

cluster into what we call kind of like a

06:23

more abstract or higher level summary of

06:25

that

06:26

content and we do that recursively until

06:29

we end up with one cluster that's it so

06:32

what's happening is you're starting with

06:34

the set of what they call leaves or like

06:36

raw documents you do a

06:38

grouping uh via clustering you do a

06:41

summarization steps you're kind of

06:42

compressing and then you do it again and

06:45

the idea is that these kind of midlevel

06:49

or eventually like root level or highest

06:51

level summaries can consolidate

06:53

information from different places in

06:54

your documents now what they do is they

06:58

basically just embed those summaries

07:00

along with the raw leavs and they

07:01

perform retrieval and we'll talk about

07:03

that a little bit later but what they

07:05

show is actually just doing retrieval on

07:08

all of these together like as a whole

07:10

pool performs best um and that's kind of

07:13

a nice result it's pretty easy then to

07:16

basically just index that and and use

07:19

it I will make a note that their paper

07:22

talked about you know these leavs being

07:24

chunks which I didn't love because look

07:26

I want to work with long context models

07:29

and like I don't want to deal with

07:30

chunking at all and I've replied you

07:33

know I replied to Jerry's tweet on this

07:35

and Jerry made a fair point that you

07:36

know this can scale to anything so for

07:40

example those leavs can be full

07:43

documents they don't have to be chunks

07:45

that's completely reasonable Point um so

07:48

again you can kind of think about this

07:49

as if idea one was let's just take each

07:52

document and embed it idea two is well

07:56

let's embed each document like we did

07:58

and we can also build kind of like a a

08:00

document abstraction Tree on top and

08:02

embed those so we have these like higher

08:04

level summaries in our embeddings which

08:07

we can retrieve from if we need an

08:08

answer to conate information from like a

08:10

small set of documents right so it's a

08:13

little bit more robust maybe to this

08:15

problem which is that if I'm just doing

08:18

KNN on Raw documents and I need

08:20

information from like two or three

08:21

documents I'm not guaranteed to always

08:24

get that because of this K parameter

08:26

that I set I'm only retrieving k docks

08:30

whereas here I'm building these docks

08:33

that contain information from multiple

08:36

leaves or multiple you know suboc so to

08:39

speak um and it can actually just

08:41

capture that information uh in in a in

08:44

kind of a a nice way um such that it can

08:49

it can basically integrate information

08:51

across different individual leads or

08:53

individual documents so that's the key

08:56

Point um and so we can you can kind of

08:59

see when you think about like working

09:01

long context models of course context

09:03

stuffing is a great option if you can do

09:04

it but there are some other interesting

09:07

ideas one is actually just embedding

09:09

full documents and another is this idea

09:10

of again documents and an abstraction

09:13

tree so let's go ahead and just build

09:16

Raptor because it's pretty interesting

09:18

and to do this I'm actually going to

09:20

going to look at clae 3 which just came

09:22

out today it's a new set of model Str

09:24

anthropic really strong performance and

09:26

should be really good for this use case

09:29

because what I want to do is I want to

09:31

perform summaries of individual

09:33

documents and I don't really want to

09:34

worry about the size of those

09:36

documents um so I'm going to use the

09:39

same set of documents that I previously

09:42

did with the code generation example

09:45

that video came out last week and I have

09:46

an empty notebook here um it we just do

09:50

a few pip installs I'm setting a few

09:52

environment variables for lsmith and now

09:54

I'm just going to say grab my docs so

09:57

that's right here

09:59

and this is going to grab around 33 web

10:02

pages of documentation from for Lang

10:04

chain related to Lang chain expression

10:06

language okay and what I'm going to plot

10:09

here is a histogram of the token counts

10:11

of every page so a bunch are kind of

10:13

small that's find easy to work with so

10:15

less than 2,000 tokens a few are pretty

10:17

big so up to like 12,000

10:20

tokens so that kind of gives you a sense

10:22

of the distribution of pages that we

10:23

want to work with and we're going to

10:25

apply this approach to those pages um

10:28

now I'm going to use anthropics new

10:30

model to do that um and I'll use open I

10:32

embeddings so that's fine I set those

10:35

and now what I'm going to do so this

10:38

code was released uh by the authors of

10:40

the paper and I'm going to explain how

10:42

this works in a little bit but for right

10:45

now I'm just going to copy this over and

10:46

this is all going to be accessible to

10:47

you in the notebook that we're going to

10:48

make public uh so this is all the

10:51

clustering code and we're going to talk

10:52

about what it's doing later I added

10:54

comments and Doc strings to this um so

10:57

it's it's a little bit more

10:58

understandable

10:59

here's some code that I wrote um that

11:01

basically is going to do like

11:03

orchestrate the process of the cluster

11:06

summarize um and then like iteratively

11:08

perform that until you end up with a

11:10

single

11:11

cluster um so there we go I'm going to

11:14

copy this code over and I'm going to

11:16

kick this process off and then I'm going

11:18

to walk through it while this is running

11:20

so that's running now now first I want

11:23

to kind of explain how this clustering

11:24

process works it's it's kind of

11:26

interesting um so the idea actually

11:31

incorporates three important actually

11:33

Four important

11:35

points so it's using this GMM this

11:37

gussian mixture model to model the

11:40

distribution of the different clusters

11:43

so what's kind of cool about this

11:45

approach is that you don't actually tell

11:47

it group the data into some number of

11:50

clusters like you do some of approaches

11:52

you kind of set the parameter you want

11:54

like n clusters here um it actually can

11:57

kind of infer or deter determine the

11:59

optimal number of clusters and it uses

12:01

this like Bic again you can dig into

12:03

this in more detail if you want but the

12:05

intuition is that uh this approach will

12:08

kind of guess or attempt to to determine

12:10

the number of clusters automatically for

12:13

you um and it's also modeling the

12:16

distribution of your individual

12:18

documents across the

12:20

Clusters um it uses this umap or

12:24

basically it's a dimensionality

12:25

reduction approach to improve the

12:27

clustering process so if you want to

12:29

like really read into this that you

12:31

should actually just go and do that um

12:33

the intuition is that this actually

12:35

helps improve

12:36

clustering um it also does clustering

12:39

what they call like local and Global so

12:41

it tries to analyze the data at two

12:43

different scales um like kind of look at

12:46

like patterns kind of within smaller

12:48

groups and then like within the full

12:50

data set to try to improve how you're

12:52

going to group these documents uh

12:54

together and it applies thresholding to

12:59

assign the basically the group

13:01

assignment for every document or the

13:03

cluster assignment for every document so

13:05

this is really the

13:07

idea here's all my documents let's look

13:10

at this one what's happening is it's

13:12

using this GMM to basically assign of

13:15

probability that this document belongs

13:17

to each one of our clusters so like

13:20

here's cluster one here's cluster two

13:22

here's cluster three each cluster will

13:24

get a

13:25

probability and this thresholding then

13:27

is applied to those

13:29

probabilities such that a document can

13:32

actually belong to more than one cluster

13:35

so that's actually really nice cuz in a

13:37

lot of other approaches it's kind of

13:39

mutually exclusive so document can only

13:41

live in one or another cluster but with

13:43

this approach it can actually be long to

13:44

multiple clusters so that's like a nice

13:46

benefit of this

13:48

approach um I think that's kind of all I

13:51

want to say initially about this

13:53

clustering strategy uh but you should

13:56

absolutely have a look at the paper

13:58

which I uh will also ensure that we

14:01

link um so right now let's actually go

14:05

look at the code so we can see that it's

14:07

performing this this cluster

14:09

generation and let's actually look at

14:11

kind of what it's doing so it really

14:14

follows what we just talked

14:16

about we have a set of texts and in this

14:20

case my texts are just that those you

14:22

know those 33 web pages uh that I'm

14:26

passing in so we can actually look at

14:27

that so what I passed in these Leaf text

14:31

Leaf text I Define as my docs again

14:33

let's actually go back and look at our

14:35

diagram so we can like follow

14:37

along uh here we go so these leaves are

14:41

my web pages that's it so here's my leaf

14:44

text and you can see let's look at the

14:46

length there uh okay there's 31 of them

14:50

so that's

14:51

fine um and what's happening is those

14:57

first

14:59

get embedded as

15:01

expected and then so here's the

15:04

embeddings and then they get clustered

15:06

and this perform clustering is taken

15:08

directly from basically the results uh

15:10

or the code provided by the authors of

15:12

the paper so it's doing that process I

15:15

just talked about um of clustering

15:19

basically cluster assignment um and we

15:22

get our cluster labels out we put those

15:24

in a data frame um and so so then we

15:28

have our clusters you can see that here

15:31

and because each docking can belong to

15:34

more than one cluster we actually expand

15:36

out the data frame um so that the

15:40

cluster column um Can may contain

15:45

duplicates uh for a single document so

15:48

one document can live in multiple

15:50

clusters and we just flatten it out to

15:51

show that then all we do is we get the

15:55

whole list of clusters here um the

15:58

Define a summarization prompt pretty

16:00

simple and all we do is we have our data

16:02

frame Just Fish Out give me all the

16:05

texts within each cluster and that's all

16:07

we're doing here for each cluster get

16:10

all the text Plum it into our

16:12

summarization prompt generate the

16:14

summary here's our summary data frame

16:17

that's really it so again iterate

16:19

through our clusters get the text for

16:22

every cluster summarize it write that

16:25

out to a data frame and that's all we do

16:28

here's our cluster data frame here's our

16:30

summary data frame from that function um

16:33

and this is just orchestrating that

16:35

process of like iteration so we just

16:37

keep doing this until I provide like a

16:41

level or n levels parameter you can say

16:42

do this end times or um you know uh or

16:48

um the number of clusters is is equal to

16:51

one so so basically this is saying

16:54

continue until either we've done n

16:56

levels or like n number of of of

16:59

attempts um in our tree or the number of

17:02

clusters is one keep doing that and

17:04

that's it so you can see we've actually

17:05

run that process we have our results

17:08

Here and Now what we can do is pretty

17:12

simply um we can just put those

17:15

into uh an index like we can use chroma

17:18

as a vector store um so here's just some

17:21

really simple code to do that or just

17:22

iterating through our results we're

17:24

getting all our summaries out so first

17:27

maybe I should make this a little B more

17:28

clear we take all those raw

17:31

documents and we add we create like we

17:34

add those to our text lists we then get

17:36

all of our summaries from our tree we

17:38

add those and we just index all of them

17:40

so let's do that so these These are

17:43

going to all be added them to chroma and

17:46

very finally we can set up a retrieval

17:48

chain that is

17:50

using this index which contains both our

17:53

leaves so all those raw web pages and

17:56

these higher level summary pages that's

17:58

all that's happening here we pull in a

18:00

rag prompt um here's our retriever

18:04

here's our question so let's give this a

18:05

shot so this is running and I want to

18:08

just bring you back to the diagram so

18:10

again you can kind of see what's going

18:11

on

18:12

here

18:14

um right here so again we took our web

18:19

pages uh again 31 of them we cluster

18:23

them we summarize them we do that

18:25

iteratively um then what we do is we

18:28

take those summaries that we generated

18:31

along with the raw web pages and we

18:33

index all of them that's it and we can

18:36

use that index for retrieval so this is

18:38

like a nice what we might call a long

18:40

context index because it contains just

18:43

raw web pages which vary from 2,000 to

18:45

12,000 tokens and it contains in our

18:48

case these higher level summaries in

18:51

case we need an integration of

18:52

information across those pages um which

18:55

may or may not be captured just using

18:58

K&N retrieval so that's the big idea

19:01

okay this ran we got our answer we can

19:03

check Langs Smith and we can see here's

19:07

our

19:07

retriever um and let's see here's the

19:11

raw

19:12

documents so it looks like it retrieved

19:15

some kind of higher level summaries as

19:18

well as some raw leavs so this is like a

19:21

raw web page and then some of these are

19:24

more like summary Pages which looks like

19:26

we produced so what's kind of cool about

19:29

this is you can retrieve from a

19:31

combination of like your raw Pages as

19:35

well as these higher level summaries

19:37

which gives you some robustness and

19:38

cement the coverage for different types

19:40

of questions that require like different

19:42

resolutions of of abstraction or

19:44

detailed answer like a really detailed

19:47

code question you might retrieve

19:50

directly from your raw pages but like a

19:52

higher level question that integrates

19:54

information from a bunch of pages you

19:56

might retrieve from these midlevel or

19:58

even top level summaries so it's a cool

20:00

approach it integrates kind of nicely

20:03

with long context models and I know one

20:06

thing that will come up here is well

20:08

look your full context was only 60,000

20:11

tokens you could just stuff all of that

20:13

into one of these models you didn't need

20:15

to do any of this that is absolutely

20:18

true for this case but what I think the

20:20

high level point is that's not true for

20:22

every case for example this exact set of

20:26

documents I want to use with mraw mraw

20:30

is only 33 32,000 tokens so this is a

20:34

really nice approach for that case where

20:36

I can kind of guarantee that I can index

20:39

across all these pages but I won't

20:41

exceed the context limit or and likely

20:43

to exceed the context limit of my llm

20:46

because none of these individual Pages

20:48

exceed 32,000 tokens so you know you can

20:51

see and again this scale is arbitrarily

20:54

large so it is true that this set of

20:55

documents is only 62,000 tokens

20:58

but of course there's much larger

20:59

corpuses which could extend beyond even

21:01

the 200,000 of CLA 3 in which case this

21:05

type of approach of kind of indexing

21:07

across documents um and building these

21:10

like kind of mid-level high level

21:11

summaries can be applicable so it's a

21:14

cool method it's a neat paper um I

21:16

definitely encourage you to experiment

21:18

with it um and all this code will be

21:21

available um for you to to work with and

21:25

um I think that's about it thanks very

21:28

much

Rate This

5.0 / 5 (0 votes)

相关标签
长文本处理Raptor方法长上下文模型信息检索文档摘要机器学习自然语言处理技术优化数据压缩知识整合