Can LLMs reason? | Yann LeCun and Lex Fridman

Lex Clips
13 Mar 202417:54

Summary

TLDR该视频脚本探讨了大型语言模型(LLMs)的推理能力及其局限性。指出LLMs在处理问题时,无论问题的复杂性如何,都以恒定的计算量生成答案,这与人类处理复杂问题时投入更多时间和精力的方式不同。提出了未来的对话系统可能会采用基于能量的模型,通过优化过程在抽象思维空间中规划答案,而不是简单地自回归生成文本。这种系统将能够更深入地进行推理和规划,提高答案的质量。

Takeaways

  • 🤖 LLM(大型语言模型)的推理类型非常原始,因为每个生成的token所花费的计算量是恒定的。
  • 🔄 不论问题的复杂性如何,系统能够投入解答的计算量是固定的,与生成的token数量成正比。
  • 🧠 人类的推理方式与LLM不同,面对复杂问题时会花费更多时间进行思考和解答。
  • 🔄 人类的思维具有预测性、迭代性和层次性,而LLM目前无法进行这种类型的复杂推理。
  • 🚀 未来的对话系统可能会采用不同的蓝图,与自回归LLM有显著差异,可能会包含长期记忆和推理能力。
  • 🤔 需要构建一个世界模型作为基础,在此基础上可以构建更高级的推理机制。
  • 🌟 系统一中的任务可以下意识完成,而系统二中的任务需要有意识地规划和思考。
  • 🛠 未来的对话系统可能会在转化为文本前,通过优化过程来思考和规划它们的答案。
  • 🌐 通过在抽象表示空间而非具体的token序列空间进行优化,可以更高效地迭代和精炼答案。
  • 📈 训练一个基于能量的模型需要展示兼容和不兼容的样本对,并调整网络权重以确保低能量对应兼容的样本对。
  • 🎯 对于视觉数据,通过计算预测误差(能量)来评估图像或视频的质量,这种方法可以为视觉现实提供良好的压缩表示。

Q & A

  • LLM中的推理类型为什么被认为是原始的?

    -LLM中的推理类型被认为是原始的,因为每个生成的token所花费的计算量是恒定的。这意味着无论问题的复杂性如何,系统用于计算答案的资源都是相同的。

  • 人类如何处理复杂问题与LLM有何不同?

    -人类在面对复杂问题时会花费更多时间进行思考和解决,会有预测、迭代和层次化的过程。与LLM不同,人类会根据问题的难度调整投入的精力和时间。

  • 如何改进LLM以实现更高级的推理和规划能力?

    -可以通过构建一个良好的世界模型,并在其基础上建立持久的长期记忆或推理机制等。未来的对话系统将能够在回答前进行思考和规划,这将与自回归LLM有很大不同。

  • 什么是系统一和系统二,它们在人类心理学中代表什么?

    -系统一是人们可以不经过深思熟虑就能完成的任务,例如开车或熟练玩棋。系统二涉及需要计划和思考的任务,如对有经验的棋手进行棋局规划。

  • 未来的对话系统将如何规划它们的答案?

    -未来的对话系统将通过优化过程在抽象表示空间中规划它们的答案,而不是简单地通过自回归解码器将文本转换为答案。

  • 什么是能量模型,它是如何工作的?

    -能量模型是一种函数,它输出一个标量数值,表示给定提示的好答案或坏答案。通过优化过程,模型可以在抽象表示空间中寻找最小化该数值的答案。

  • 如何训练一个能量模型?

    -训练能量模型通常涉及向模型展示兼容和不兼容的X和Y对,并调整神经网络的参数以产生正确的输出。对比方法和非对比方法是两种常见的训练方法。

  • 在能量模型中,X和Y的好表示是什么?

    -在能量模型中,好的X和Y表示通常是抽象的概念表示,而不仅仅是直接的语言文本。这些表示可以经过优化过程,以最小化输出能量并转化为好的答案。

  • 为什么在连续空间中进行优化比在离散空间中更有效?

    -在连续空间中进行优化可以使用梯度下降等方法,这样可以迭代地改进答案并趋向于最优解。而在离散空间中,需要生成大量假设并选择最佳选项,这种计算方式效率较低。

  • 如何确保LLM不会对所有输入给出相同的答案?

    -通过在训练过程中最小化交叉熵,LLM会增加给正确单词的概率,同时减少给错误单词的概率。这种间接方式确保了模型不会对所有输入给出相同的答案。

  • 在视觉数据中,如何应用能量模型?

    -在视觉数据中,能量模型可以通过计算预测误差(即表示的预测误差)来衡量两个图像或视频之间的兼容性。这为视觉现实提供了一种压缩的表示形式。

Outlines

00:00

🤖 LLM的推理类型及其局限性

本段讨论了大型语言模型(LLM)的推理类型,指出其推理方式非常原始。原因在于,模型在生成每个令牌时所花费的计算量是恒定的,这意味着无论问题简单还是复杂,系统用于解答的计算资源都是相同的。这与人类面对复杂问题时会投入更多时间进行思考和解答的推理方式不同。此外,提出了构建更高级推理系统的可能性,这种系统将基于一个良好的世界模型,并通过长期记忆或推理等机制来增强能力。

05:00

🌟 未来对话系统的构建蓝图

这一段探讨了未来对话系统的构建蓝图,强调了系统在回答前进行思考和规划的重要性。提出了一种基于能量模型的方法,通过优化过程在抽象的表示空间中形成思想,然后通过自回归解码器将这些思想转化为文本。这种系统将不同于现有的自回归语言模型,并且将通过梯度下降等方法在连续空间中进行优化,从而提高效率。

10:03

🧠 训练基于能量的模型的挑战

本段讨论了如何训练一个基于能量的模型来更好地理解和生成好的答复。能量模型是一个输出标量数值的函数,用于衡量一个答案对于给定问题的好坏。训练这样的模型需要展示大量兼容和不兼容的样本对,并调整神经网络的权重以确保模型能够正确区分好的和不好的答案。此外,还提到了对比方法和非对比方法,以及如何通过正则化来确保模型在未训练的领域也能表现良好。

15:06

📈 视觉数据的能量函数和训练方法

这一段转向视觉数据,讨论了如何使用能量函数来训练模型以识别和处理图像或视频。通过比较原始图像和其损坏版本的表示,模型可以学习预测误差,从而得到一个压缩的、对视觉现实的良好表示。这种方法已经在联合嵌入架构中得到应用,并且已经被证明对于分类系统等任务非常有效。

Mindmap

Keywords

💡推理

推理是指根据已知信息得出结论的过程。在视频中,推理被提及为一种在大型语言模型(LLMs)中非常原始的思考方式,因为模型在生成每个令牌时所花费的计算量是恒定的,这与人类面对复杂问题时会投入更多时间进行思考和解决的方式不同。

💡计算量

计算量是指执行特定任务所需的计算资源量。在视频中,讨论了大型语言模型(LLMs)在生成答案时,无论问题的复杂性如何,每个生成的令牌所分配的计算量是恒定的,这限制了模型处理复杂问题的能力。

💡令牌

令牌是自然语言处理中的基本单位,通常是单词、短语或其他语言元素。在视频中,提到了大型语言模型(LLMs)在生成文本时,会将问题和答案分解为一系列的令牌,并为每个令牌分配相同的计算资源。

💡预测网络

预测网络是一种用于预测未来事件或数据点的神经网络结构。在视频中,预测网络的大小,如36层或92层,与生成的令牌数量相乘,决定了模型生成答案时的计算量。

💡世界模型

世界模型是指对现实世界的理解和表示,它可以帮助系统更好地理解和处理信息。在视频中,提到了构建一个良好的世界模型是实现更高级推理和规划能力的关键,这与LLMs当前的工作原理不同。

💡长期记忆

长期记忆是指能够存储和回忆过去信息的能力,对于复杂问题的解决和深度推理至关重要。在视频中,长期记忆被视为未来对话系统中可能发展出的一种机制,以支持更深层次的思考和规划。

💡优化

优化是指改进某个系统或过程,使其达到最佳性能的过程。在视频中,优化被用来描述如何通过调整抽象表示空间中的思想来生成更好的答案的过程。

💡能量模型

能量模型是一种通过输出一个标量数值来评估输入数据之间兼容性的函数。在视频中,能量模型被提出作为一种可能的未来对话系统的设计蓝图,它通过最小化一个数值来生成好的答案。

💡抽象表示

抽象表示是指将具体的概念、思想或问题转换为更一般化的、不依赖于具体语言或符号的形式。在视频中,抽象表示被提及为未来对话系统中的一个关键组成部分,它允许系统在不考虑具体语言的情况下进行思考和规划。

💡梯度下降

梯度下降是一种用于优化不同iable函数的算法,通过迭代调整参数以最小化函数的输出。在视频中,梯度下降被提及为一种有效的方法,用于在连续空间中优化抽象表示,以生成更好的答案。

💡概念空间

概念空间是指一个抽象的、用于表示和处理概念或思想的空间,它超越了具体的感官信息。在视频中,概念空间被提及为未来对话系统中进行优化的一个环境,系统在这个空间中优化其答案的抽象表示。

Highlights

LLMs(大型语言模型)的推理类型非常原始,因为每个生成的token所花费的计算量是恒定的。

无论问题是简单、复杂还是无法解答,系统用于解答的计算量是恒定的,这与人类处理复杂问题的方式不同。

人类在面对复杂问题时会花费更多时间进行解决和回答,而LLMs缺乏这种推理和迭代调整理解的能力。

未来的对话系统将能够在回答前进行计划和优化,这与自回归LLMs有很大的不同。

系统将使用一个巨大的神经网络,其输出是一个标量数值,用来衡量答案对问题的好坏。

通过优化过程,在抽象的表示空间中形成思想,而不是直接生成文本。

未来的对话系统将在转化为文本之前,通过优化过程来思考和计划它们的答案。

优化问题的目标函数是在抽象表示空间中进行的,而不是在可能的文本序列空间中。

通过梯度下降和反向传播,可以优化答案的抽象表示,使其更接近最佳答案。

系统的训练可以通过对比方法和非对比方法来进行,以确保对于训练集外的样本也能有正确的能量输出。

能量函数需要在训练集的兼容样本上具有低能量,在其他样本上具有高能量。

通过正则化项来确保能量函数在训练集外的样本上具有更高的能量。

在LLMs中,通过最小化交叉熵来间接地给予好序列高概率,坏序列低概率。

对于视觉数据,通过预测误差来衡量图像或视频的好坏,从而得到视觉现实的压缩表示。

使用这种表示作为输入的分类系统能够很好地工作,证明了这种方法的有效性。

LLMs目前无法进行深度推理或计划,但未来的系统将能够在答案生成前进行深度思考和推理。

推理和规划能力的提升将使对话系统在处理复杂问题时更具效率和准确性。

构建一个能够进行深度推理的系统需要从抽象的概念空间出发,而不是仅仅依赖于语言的直接表示。

Transcripts

00:03

the type of reasoning that takes place

00:04

in llm is very very primitive and the

00:07

reason you can tell is primitive is

00:09

because the amount of computation that

00:11

is spent per token produced is constant

00:15

so if you ask a question and that

00:17

question has an answer in a given number

00:20

of token the amount of competition

00:22

devoted to Computing that answer can be

00:24

exactly estimated it's like you know

00:27

it's how it's the the size of the

00:30

prediction Network you know with its 36

00:32

layers or 92 layers or whatever it is uh

00:35

multiply by number of tokens that's it

00:37

and so essentially it doesn't matter if

00:40

the question being asked is is simple to

00:45

answer complicated to answer impossible

00:48

to answer because it's undecidable or

00:50

something um the amount of computation

00:53

the system will be able to devote to

00:55

that to the answer is constant or is

00:57

proportional to the number of token

00:59

produced in the answer right this is not

01:01

the way we work the way we reason is

01:04

that when we're faced with a complex

01:08

problem or complex question we spend

01:10

more time trying to solve it and answer

01:12

it right because it's more difficult

01:15

there's a prediction element there's a

01:17

iterative element where you're like

01:21

uh adjusting your understanding of a

01:23

thing by going over over and over and

01:25

over there's a hierarchical element so

01:27

on does this mean that a fundamental

01:29

flaw of llms or does it mean

01:32

that there's more part to that

01:35

question now you're just behaving like

01:37

an

01:38

llm immediately answer no that that it's

01:43

just the lowlevel world model on top of

01:46

which we can then build some of these

01:49

kinds of mechanisms like you said

01:51

persistent long-term memory

01:53

or uh reasoning so on but we need that

01:57

world model that comes from language is

02:00

it maybe it is not so difficult to build

02:03

this kind of uh reasoning system on top

02:06

of a well constructed World model OKAY

02:09

whether it's difficult or not the near

02:11

future will will say because a lot of

02:13

people are working on reasoning and

02:15

planning abilities for for dialogue

02:18

systems um I mean if we're even if we

02:20

restrict ourselves to

02:22

language uh just having the ability to

02:25

plan your answer before you

02:27

answer uh in terms that are not

02:29

necessarily linked with the language

02:31

you're going to use to produce the

02:33

answer right so this idea of this mental

02:35

model that allows you to plan what

02:36

you're going to say before you say it MH

02:40

um that is very important I think

02:43

there's going to be a lot of systems

02:45

over the next few years are going to

02:47

have this capability but the blueprint

02:50

of those systems will be extremely

02:52

different from Auto regressive LMS so

02:57

um it's the same difference as has the

03:00

difference between what psychology is

03:02

called system one and system two in

03:03

humans right so system one is the type

03:06

of task that you can accomplish without

03:08

like deliberately consciously think

03:09

about how you do them you just do them

03:13

you've done them enough that you can

03:15

just do it subconsciously right without

03:17

thinking about them if you're an

03:18

experienced driver you can drive without

03:21

really thinking about it and you can

03:23

talk to someone at the same time or

03:24

listen to the radio right um if you are

03:28

a very experienced chest player you can

03:30

play against a non-experienced CH player

03:32

without really thinking either you just

03:34

recognize the pattern and you play mhm

03:36

right that's system one um so all the

03:40

things that you do instinctively without

03:41

really having to deliberately plan and

03:44

think about it and then there is all

03:45

task what you need to plan so if you are

03:48

a not to experienced uh chess player or

03:51

you are experienced where you play

03:52

against another experienced chest player

03:54

you think about all kinds of options

03:56

right you you think about it for a while

03:58

right and you you you're much better if

04:01

you have time to think about it than you

04:02

are if you are if you play Blitz uh with

04:05

limited time so and um so this type of

04:09

deliberate uh planning which uses your

04:12

internal World model um that system to

04:16

this is what LMS currently cannot do so

04:18

how how do we get them to do this right

04:20

how do we build a system that can do

04:22

this kind of planning that or reasoning

04:26

that devotes more resources to complex

04:29

part problems than two simple problems

04:32

and it's not going to be Auto regressive

04:33

prediction of tokens it's going to be

04:36

more something akin to inference of

04:40

latent variables in um you know what

04:44

used to be called probalistic models or

04:47

graphical models and things of that type

04:49

so basically the principle is like this

04:51

you you know the prompt is like observed

04:55

uh variables mhm and what you're what

04:59

the model

05:00

does is that it's basically a

05:03

measure of it can measure to what extent

05:06

an answer is a good answer for a prompt

05:10

okay so think of it as some gigantic

05:12

Neal net but it's got only one output

05:14

and that output is a scalar number which

05:17

is let's say zero if the answer is a

05:19

good answer for the question and a large

05:22

number if the answer is not a good

05:23

answer for the question imagine you had

05:25

this model if you had such a model you

05:28

could use it to produce good answers the

05:30

way you would do

05:32

is you know produce the prompt and then

05:34

search through the space of possible

05:36

answers for one that minimizes that

05:39

number um that's called an energy based

05:42

model but that energy based model would

05:45

need the the model constructed by the

05:49

llm well so uh really what you need to

05:52

do would be to not uh search over

05:55

possible strings of text that minimize

05:57

that uh energy but what you would do it

06:00

do this in abstract representation space

06:02

so in in sort of the space of abstract

06:05

thoughts you would elaborate a thought

06:08

right using this process of minimizing

06:11

the output of your your model okay which

06:14

is just a scalar um it's an optimization

06:17

process right so now the the way the

06:19

system produces its answer is through

06:22

optimization um by you know minimizing

06:25

an objective function basically right uh

06:28

and this is we're talking about

06:28

inference not talking about training

06:30

right the system has been trained

06:32

already so now we have an abstract

06:34

representation of the thought of the

06:36

answer representation of the answer we

06:38

feed that to basically an auto

06:40

regressive decoder uh which can be very

06:42

simple that turns this into a text that

06:45

expresses this thought okay so that that

06:48

in my opinion is the blueprint of future

06:50

dialog systems um they will think about

06:54

their answer plan their answer by

06:56

optimization before turning it into text

07:00

uh and that is turning complete can you

07:03

explain exactly what the optimization

07:05

problem there is like what's the

07:07

objective function just Linger on it you

07:10

you kind of briefly described it but

07:13

over what space are you optimizing the

07:15

space of

07:16

representations goes abstract

07:18

representation abstract repres so you

07:20

have an abstract representation inside

07:22

the system you have a prompt The Prompt

07:24

goes through an encoder produces a

07:26

representation perhaps goes through a

07:27

predictor that predicts a representation

07:29

of the answer of the proper answer but

07:31

that representation may not be a good

07:35

answer because there might there might

07:36

be some complicated reasoning you need

07:38

to do right so um so then you have

07:41

another process that takes the

07:44

representation of the answers and

07:46

modifies it so as to

07:49

minimize uh a cost function that

07:51

measures to what extent the answer is a

07:53

good answer for the question now we we

07:56

sort of ignore the the fact for I mean

07:59

the the issue for a moment of how you

08:01

train that system to measure whether an

08:05

answer is a good answer for for but

08:07

suppose such a system could be created

08:10

but what's the process this kind of

08:12

search like process it's a optimization

08:15

process you can do this if if the entire

08:17

system is

08:18

differentiable that scalar output is the

08:21

result of you know running through some

08:23

neural net MH uh running the answer the

08:26

representation of the answer to some

08:27

neural net then by GR

08:29

by back propag back propagating

08:31

gradients you can figure out like how to

08:33

modify the representation of the answer

08:35

so as to minimize that so that's still

08:37

gradient based it's gradient based

08:39

inference so now you have a

08:40

representation of the answer in abstract

08:42

space now you can turn it into

08:45

text right and the cool thing about this

08:49

is that the representation now can be

08:52

optimized through gr and descent but

08:54

also is independent of the language in

08:56

which you're going to express the

08:58

answer right so you're operating in the

09:00

substract representation I mean this

09:02

goes back to the Joint embedding that is

09:04

better to work in the uh in the space of

09:08

I don't know to romanticize the notion

09:10

like space of Concepts versus yeah the

09:13

space of

09:15

concrete sensory information

09:18

right okay but this can can this do

09:21

something like reasoning which is what

09:22

we're talking about well not really in a

09:24

only in a very simple way I mean

09:26

basically you can think of those things

09:27

as doing the kind of optimization I was

09:30

I was talking about except they optimize

09:32

in the discrete space which is the space

09:34

of possible sequences of of tokens and

09:37

they do it they do this optimization in

09:39

a horribly inefficient way which is

09:41

generate a lot of hypothesis and then

09:43

select the best ones and that's

09:46

incredibly wasteful in terms of uh

09:49

computation because you have you run you

09:51

basically have to run your LM for like

09:53

every possible you know Genera sequence

09:56

um and it's incredibly wasteful

09:59

um so it's much better to do an

10:03

optimization in continuous space where

10:05

you can do gr and descent as opposed to

10:07

like generate tons of things and then

10:08

select the best you just iteratively

10:11

refine your answer to to go towards the

10:13

best right that's much more efficient

10:15

but you can only do this in continuous

10:17

spaces with differentiable functions

10:19

you're talking about the reasoning like

10:22

ability to think deeply or to reason

10:25

deeply how do you know what

10:29

is an

10:31

answer uh that's better or worse based

10:34

on deep reasoning right so then we're

10:37

asking the question of conceptually how

10:39

do you train an energy based model right

10:41

so energy based model is a function with

10:43

a scalar output just a

10:45

number you give it two inputs X and Y M

10:49

and it tells you whether Y is compatible

10:51

with X or not X You observe let's say

10:53

it's a prompt an image a video whatever

10:56

and why is a proposal for an answer a

10:59

continuation of video um you know

11:03

whatever and it tells you whether Y is

11:05

compatible with X and the way it tells

11:07

you that Y is compatible with X is that

11:09

the output of that function will be zero

11:11

if Y is compatible with X it would be a

11:14

positive number non zero if Y is not

11:17

compatible with X okay how do you train

11:19

a system like this at a completely

11:22

General level is you show it pairs of X

11:26

and Y that are compatible equ question

11:28

and the corresp answer and you train the

11:31

parameters of the big neural net inside

11:34

um to produce zero M okay now that

11:37

doesn't completely work because the

11:39

system might decide well I'm just going

11:41

to say zero for everything so now you

11:43

have to have a process to make sure that

11:45

for a a wrong y the energy would be

11:48

larger than zero and there you have two

11:51

options one is contrastive Method so

11:53

contrastive method is you show an X and

11:55

a bad

11:56

Y and you tell the system well that's

11:59

you know give a high energy to this like

12:01

push up the energy right change the

12:02

weights in the neural net that confus

12:04

the energy so that it goes

12:06

up um so that's contrasting methods the

12:09

problem with this is if the space of Y

12:12

is large the number of such contrasted

12:15

samples you're going to have to show is

12:19

gigantic but people do this they they do

12:22

this when you train a system with RF

12:25

basically what you're training is what's

12:28

called a reward model which is basically

12:30

an objective function that tells you

12:32

whether an answer is good or bad and

12:34

that's basically exactly what what this

12:37

is so we already do this to some extent

12:40

we're just not using it for inference

12:41

we're just using it for training um uh

12:45

there is another set of methods which

12:47

are non-contrastive and I prefer those

12:50

uh and those non-contrastive method

12:52

basically

12:53

say uh okay the energy function needs to

12:58

have low energy on pairs of xys that are

13:01

compatible that come from your training

13:03

set how do you make sure that the energy

13:05

is going to be higher everywhere

13:07

else and the way you do this is by um

13:11

having a regularizer a Criterion a term

13:15

in your cost function that basically

13:17

minimizes the volume of space that can

13:21

take low

13:22

energy and the precise way to do this is

13:24

all kinds of different specific ways to

13:26

do this depending on the architecture

13:28

but that's the basic principle so that

13:30

if you push down the energy function for

13:33

particular regions in the XY space it

13:35

will automatically go up in other places

13:37

because there's only a limited volume of

13:40

space that can take low energy okay by

13:43

the construction of the system or by the

13:45

regularizer regularizing function we've

13:48

been talking very generally but what is

13:51

a good X and a good Y what is a good

13:53

representation of X and Y cuz we've been

13:57

talking about language and if you just

13:59

take language directly that presumably

14:02

is not good so there has to be some kind

14:04

of abstract representation of

14:06

ideas yeah so you I mean you can do this

14:09

with language directly um by just you

14:12

know X is a text and Y is the

14:14

continuation of that text yes um or X is

14:17

a question Y is the answer but you're

14:20

you're saying that's not going to take

14:21

it I mean that's going to do what LMS

14:22

are time well no it depends on how you

14:26

how the internal structure of the system

14:28

is built if the if the internal

14:29

structure of the system is built in such

14:31

a way that inside of the system there is

14:34

a latent variable that's called Z that

14:37

uh you can manipulate so as to minimize

14:42

the output

14:43

energy then that Z can be viewed as a

14:46

representation of a good answer that you

14:48

can translate into a y that is a good

14:51

answer so this kind of system could be

14:54

trained in a very similar way very

14:56

similar way but you have to have this

14:58

way of preventing collapse of of

15:00

ensuring that you know there is high

15:02

energy for things you don't train it on

15:05

um and and currently it's it's very

15:09

implicit in llm it's done in a way that

15:11

people don't realize it's being done but

15:12

it is being done is is due to the fact

15:15

that when you give a high probability to

15:18

a

15:19

word automatically you give low

15:21

probability to other words because you

15:23

only have a finite amount of probability

15:26

to go around right there to some to one

15:29

um so when you minimize the cross

15:30

entropy or whatever when you train the

15:33

your llm to produce the to predict the

15:35

next word uh you're increasing the

15:38

probability your system will give to the

15:40

correct word but you're also decreasing

15:41

the probability will give to the

15:42

incorrect words now indirectly that

15:46

gives a low probability to a high

15:49

probability to sequences of words that

15:50

are good and low probability to

15:52

sequences of words that are bad but it's

15:53

very indirect and it's not it's not

15:56

obvious why this actually works at all

15:58

but um because you're not doing it on

16:01

the joint probability of all the symbols

16:03

in a in a sequence you're just doing it

16:05

kind

16:06

of you sort of factorize that

16:08

probability in terms of conditional

16:10

probabilities over successive tokens so

16:13

how do you do this for visual data so

16:15

we've been doing this with all JEA

16:17

architectures basically the joint Bing

16:19

IA so uh there are the compatibility

16:23

between two things is uh you know here's

16:25

here's an image or a video here's a

16:28

corrupted shifted or transformed version

16:29

of that image or video or masked okay

16:33

and then uh the energy of the system is

16:36

the prediction error of

16:40

the

16:42

representation uh the the predicted

16:45

representation of the Good Thing versus

16:46

the actual representation of the good

16:48

thing right so so you run the corrupted

16:51

image to the system predict the

16:53

representation of the the good input

16:55

uncorrupted and then compute the

16:57

prediction error that's energy of the

16:59

system so this system will tell you this

17:01

is a

17:04

good you know if this is a good image

17:06

and this is a corrupted version it will

17:08

give you Zero Energy if those two things

17:10

are effectively one of them is a

17:13

corrupted version of the other give you

17:15

a high energy if the if the two images

17:17

are completely different and hopefully

17:18

that whole process gives you a really

17:21

nice compressed representation of of

17:24

reality of visual reality and we know it

17:26

does because then we use those for

17:28

presentations as input to a

17:30

classification system that

17:31

classification system works really

17:32

nicely

17:52

okay