Can LLMs reason? | Yann LeCun and Lex Fridman
Summary
TLDR该视频脚本探讨了大型语言模型(LLMs)的推理能力及其局限性。指出LLMs在处理问题时,无论问题的复杂性如何,都以恒定的计算量生成答案,这与人类处理复杂问题时投入更多时间和精力的方式不同。提出了未来的对话系统可能会采用基于能量的模型,通过优化过程在抽象思维空间中规划答案,而不是简单地自回归生成文本。这种系统将能够更深入地进行推理和规划,提高答案的质量。
Takeaways
- 🤖 LLM(大型语言模型)的推理类型非常原始,因为每个生成的token所花费的计算量是恒定的。
- 🔄 不论问题的复杂性如何,系统能够投入解答的计算量是固定的,与生成的token数量成正比。
- 🧠 人类的推理方式与LLM不同,面对复杂问题时会花费更多时间进行思考和解答。
- 🔄 人类的思维具有预测性、迭代性和层次性,而LLM目前无法进行这种类型的复杂推理。
- 🚀 未来的对话系统可能会采用不同的蓝图,与自回归LLM有显著差异,可能会包含长期记忆和推理能力。
- 🤔 需要构建一个世界模型作为基础,在此基础上可以构建更高级的推理机制。
- 🌟 系统一中的任务可以下意识完成,而系统二中的任务需要有意识地规划和思考。
- 🛠 未来的对话系统可能会在转化为文本前,通过优化过程来思考和规划它们的答案。
- 🌐 通过在抽象表示空间而非具体的token序列空间进行优化,可以更高效地迭代和精炼答案。
- 📈 训练一个基于能量的模型需要展示兼容和不兼容的样本对,并调整网络权重以确保低能量对应兼容的样本对。
- 🎯 对于视觉数据,通过计算预测误差(能量)来评估图像或视频的质量,这种方法可以为视觉现实提供良好的压缩表示。
Q & A
LLM中的推理类型为什么被认为是原始的?
-LLM中的推理类型被认为是原始的,因为每个生成的token所花费的计算量是恒定的。这意味着无论问题的复杂性如何,系统用于计算答案的资源都是相同的。
人类如何处理复杂问题与LLM有何不同?
-人类在面对复杂问题时会花费更多时间进行思考和解决,会有预测、迭代和层次化的过程。与LLM不同,人类会根据问题的难度调整投入的精力和时间。
如何改进LLM以实现更高级的推理和规划能力?
-可以通过构建一个良好的世界模型,并在其基础上建立持久的长期记忆或推理机制等。未来的对话系统将能够在回答前进行思考和规划,这将与自回归LLM有很大不同。
什么是系统一和系统二,它们在人类心理学中代表什么?
-系统一是人们可以不经过深思熟虑就能完成的任务,例如开车或熟练玩棋。系统二涉及需要计划和思考的任务,如对有经验的棋手进行棋局规划。
未来的对话系统将如何规划它们的答案?
-未来的对话系统将通过优化过程在抽象表示空间中规划它们的答案,而不是简单地通过自回归解码器将文本转换为答案。
什么是能量模型,它是如何工作的?
-能量模型是一种函数,它输出一个标量数值,表示给定提示的好答案或坏答案。通过优化过程,模型可以在抽象表示空间中寻找最小化该数值的答案。
如何训练一个能量模型?
-训练能量模型通常涉及向模型展示兼容和不兼容的X和Y对,并调整神经网络的参数以产生正确的输出。对比方法和非对比方法是两种常见的训练方法。
在能量模型中,X和Y的好表示是什么?
-在能量模型中,好的X和Y表示通常是抽象的概念表示,而不仅仅是直接的语言文本。这些表示可以经过优化过程,以最小化输出能量并转化为好的答案。
为什么在连续空间中进行优化比在离散空间中更有效?
-在连续空间中进行优化可以使用梯度下降等方法,这样可以迭代地改进答案并趋向于最优解。而在离散空间中,需要生成大量假设并选择最佳选项,这种计算方式效率较低。
如何确保LLM不会对所有输入给出相同的答案?
-通过在训练过程中最小化交叉熵,LLM会增加给正确单词的概率,同时减少给错误单词的概率。这种间接方式确保了模型不会对所有输入给出相同的答案。
在视觉数据中,如何应用能量模型?
-在视觉数据中,能量模型可以通过计算预测误差(即表示的预测误差)来衡量两个图像或视频之间的兼容性。这为视觉现实提供了一种压缩的表示形式。
Outlines
🤖 LLM的推理类型及其局限性
本段讨论了大型语言模型(LLM)的推理类型,指出其推理方式非常原始。原因在于,模型在生成每个令牌时所花费的计算量是恒定的,这意味着无论问题简单还是复杂,系统用于解答的计算资源都是相同的。这与人类面对复杂问题时会投入更多时间进行思考和解答的推理方式不同。此外,提出了构建更高级推理系统的可能性,这种系统将基于一个良好的世界模型,并通过长期记忆或推理等机制来增强能力。
🌟 未来对话系统的构建蓝图
这一段探讨了未来对话系统的构建蓝图,强调了系统在回答前进行思考和规划的重要性。提出了一种基于能量模型的方法,通过优化过程在抽象的表示空间中形成思想,然后通过自回归解码器将这些思想转化为文本。这种系统将不同于现有的自回归语言模型,并且将通过梯度下降等方法在连续空间中进行优化,从而提高效率。
🧠 训练基于能量的模型的挑战
本段讨论了如何训练一个基于能量的模型来更好地理解和生成好的答复。能量模型是一个输出标量数值的函数,用于衡量一个答案对于给定问题的好坏。训练这样的模型需要展示大量兼容和不兼容的样本对,并调整神经网络的权重以确保模型能够正确区分好的和不好的答案。此外,还提到了对比方法和非对比方法,以及如何通过正则化来确保模型在未训练的领域也能表现良好。
📈 视觉数据的能量函数和训练方法
这一段转向视觉数据,讨论了如何使用能量函数来训练模型以识别和处理图像或视频。通过比较原始图像和其损坏版本的表示,模型可以学习预测误差,从而得到一个压缩的、对视觉现实的良好表示。这种方法已经在联合嵌入架构中得到应用,并且已经被证明对于分类系统等任务非常有效。
Mindmap
Keywords
💡推理
💡计算量
💡令牌
💡预测网络
💡世界模型
💡长期记忆
💡优化
💡能量模型
💡抽象表示
💡梯度下降
💡概念空间
Highlights
LLMs(大型语言模型)的推理类型非常原始,因为每个生成的token所花费的计算量是恒定的。
无论问题是简单、复杂还是无法解答,系统用于解答的计算量是恒定的,这与人类处理复杂问题的方式不同。
人类在面对复杂问题时会花费更多时间进行解决和回答,而LLMs缺乏这种推理和迭代调整理解的能力。
未来的对话系统将能够在回答前进行计划和优化,这与自回归LLMs有很大的不同。
系统将使用一个巨大的神经网络,其输出是一个标量数值,用来衡量答案对问题的好坏。
通过优化过程,在抽象的表示空间中形成思想,而不是直接生成文本。
未来的对话系统将在转化为文本之前,通过优化过程来思考和计划它们的答案。
优化问题的目标函数是在抽象表示空间中进行的,而不是在可能的文本序列空间中。
通过梯度下降和反向传播,可以优化答案的抽象表示,使其更接近最佳答案。
系统的训练可以通过对比方法和非对比方法来进行,以确保对于训练集外的样本也能有正确的能量输出。
能量函数需要在训练集的兼容样本上具有低能量,在其他样本上具有高能量。
通过正则化项来确保能量函数在训练集外的样本上具有更高的能量。
在LLMs中,通过最小化交叉熵来间接地给予好序列高概率,坏序列低概率。
对于视觉数据,通过预测误差来衡量图像或视频的好坏,从而得到视觉现实的压缩表示。
使用这种表示作为输入的分类系统能够很好地工作,证明了这种方法的有效性。
LLMs目前无法进行深度推理或计划,但未来的系统将能够在答案生成前进行深度思考和推理。
推理和规划能力的提升将使对话系统在处理复杂问题时更具效率和准确性。
构建一个能够进行深度推理的系统需要从抽象的概念空间出发,而不是仅仅依赖于语言的直接表示。
Transcripts
the type of reasoning that takes place
in llm is very very primitive and the
reason you can tell is primitive is
because the amount of computation that
is spent per token produced is constant
so if you ask a question and that
question has an answer in a given number
of token the amount of competition
devoted to Computing that answer can be
exactly estimated it's like you know
it's how it's the the size of the
prediction Network you know with its 36
layers or 92 layers or whatever it is uh
multiply by number of tokens that's it
and so essentially it doesn't matter if
the question being asked is is simple to
answer complicated to answer impossible
to answer because it's undecidable or
something um the amount of computation
the system will be able to devote to
that to the answer is constant or is
proportional to the number of token
produced in the answer right this is not
the way we work the way we reason is
that when we're faced with a complex
problem or complex question we spend
more time trying to solve it and answer
it right because it's more difficult
there's a prediction element there's a
iterative element where you're like
uh adjusting your understanding of a
thing by going over over and over and
over there's a hierarchical element so
on does this mean that a fundamental
flaw of llms or does it mean
that there's more part to that
question now you're just behaving like
an
llm immediately answer no that that it's
just the lowlevel world model on top of
which we can then build some of these
kinds of mechanisms like you said
persistent long-term memory
or uh reasoning so on but we need that
world model that comes from language is
it maybe it is not so difficult to build
this kind of uh reasoning system on top
of a well constructed World model OKAY
whether it's difficult or not the near
future will will say because a lot of
people are working on reasoning and
planning abilities for for dialogue
systems um I mean if we're even if we
restrict ourselves to
language uh just having the ability to
plan your answer before you
answer uh in terms that are not
necessarily linked with the language
you're going to use to produce the
answer right so this idea of this mental
model that allows you to plan what
you're going to say before you say it MH
um that is very important I think
there's going to be a lot of systems
over the next few years are going to
have this capability but the blueprint
of those systems will be extremely
different from Auto regressive LMS so
um it's the same difference as has the
difference between what psychology is
called system one and system two in
humans right so system one is the type
of task that you can accomplish without
like deliberately consciously think
about how you do them you just do them
you've done them enough that you can
just do it subconsciously right without
thinking about them if you're an
experienced driver you can drive without
really thinking about it and you can
talk to someone at the same time or
listen to the radio right um if you are
a very experienced chest player you can
play against a non-experienced CH player
without really thinking either you just
recognize the pattern and you play mhm
right that's system one um so all the
things that you do instinctively without
really having to deliberately plan and
think about it and then there is all
task what you need to plan so if you are
a not to experienced uh chess player or
you are experienced where you play
against another experienced chest player
you think about all kinds of options
right you you think about it for a while
right and you you you're much better if
you have time to think about it than you
are if you are if you play Blitz uh with
limited time so and um so this type of
deliberate uh planning which uses your
internal World model um that system to
this is what LMS currently cannot do so
how how do we get them to do this right
how do we build a system that can do
this kind of planning that or reasoning
that devotes more resources to complex
part problems than two simple problems
and it's not going to be Auto regressive
prediction of tokens it's going to be
more something akin to inference of
latent variables in um you know what
used to be called probalistic models or
graphical models and things of that type
so basically the principle is like this
you you know the prompt is like observed
uh variables mhm and what you're what
the model
does is that it's basically a
measure of it can measure to what extent
an answer is a good answer for a prompt
okay so think of it as some gigantic
Neal net but it's got only one output
and that output is a scalar number which
is let's say zero if the answer is a
good answer for the question and a large
number if the answer is not a good
answer for the question imagine you had
this model if you had such a model you
could use it to produce good answers the
way you would do
is you know produce the prompt and then
search through the space of possible
answers for one that minimizes that
number um that's called an energy based
model but that energy based model would
need the the model constructed by the
llm well so uh really what you need to
do would be to not uh search over
possible strings of text that minimize
that uh energy but what you would do it
do this in abstract representation space
so in in sort of the space of abstract
thoughts you would elaborate a thought
right using this process of minimizing
the output of your your model okay which
is just a scalar um it's an optimization
process right so now the the way the
system produces its answer is through
optimization um by you know minimizing
an objective function basically right uh
and this is we're talking about
inference not talking about training
right the system has been trained
already so now we have an abstract
representation of the thought of the
answer representation of the answer we
feed that to basically an auto
regressive decoder uh which can be very
simple that turns this into a text that
expresses this thought okay so that that
in my opinion is the blueprint of future
dialog systems um they will think about
their answer plan their answer by
optimization before turning it into text
uh and that is turning complete can you
explain exactly what the optimization
problem there is like what's the
objective function just Linger on it you
you kind of briefly described it but
over what space are you optimizing the
space of
representations goes abstract
representation abstract repres so you
have an abstract representation inside
the system you have a prompt The Prompt
goes through an encoder produces a
representation perhaps goes through a
predictor that predicts a representation
of the answer of the proper answer but
that representation may not be a good
answer because there might there might
be some complicated reasoning you need
to do right so um so then you have
another process that takes the
representation of the answers and
modifies it so as to
minimize uh a cost function that
measures to what extent the answer is a
good answer for the question now we we
sort of ignore the the fact for I mean
the the issue for a moment of how you
train that system to measure whether an
answer is a good answer for for but
suppose such a system could be created
but what's the process this kind of
search like process it's a optimization
process you can do this if if the entire
system is
differentiable that scalar output is the
result of you know running through some
neural net MH uh running the answer the
representation of the answer to some
neural net then by GR
by back propag back propagating
gradients you can figure out like how to
modify the representation of the answer
so as to minimize that so that's still
gradient based it's gradient based
inference so now you have a
representation of the answer in abstract
space now you can turn it into
text right and the cool thing about this
is that the representation now can be
optimized through gr and descent but
also is independent of the language in
which you're going to express the
answer right so you're operating in the
substract representation I mean this
goes back to the Joint embedding that is
better to work in the uh in the space of
I don't know to romanticize the notion
like space of Concepts versus yeah the
space of
concrete sensory information
right okay but this can can this do
something like reasoning which is what
we're talking about well not really in a
only in a very simple way I mean
basically you can think of those things
as doing the kind of optimization I was
I was talking about except they optimize
in the discrete space which is the space
of possible sequences of of tokens and
they do it they do this optimization in
a horribly inefficient way which is
generate a lot of hypothesis and then
select the best ones and that's
incredibly wasteful in terms of uh
computation because you have you run you
basically have to run your LM for like
every possible you know Genera sequence
um and it's incredibly wasteful
um so it's much better to do an
optimization in continuous space where
you can do gr and descent as opposed to
like generate tons of things and then
select the best you just iteratively
refine your answer to to go towards the
best right that's much more efficient
but you can only do this in continuous
spaces with differentiable functions
you're talking about the reasoning like
ability to think deeply or to reason
deeply how do you know what
is an
answer uh that's better or worse based
on deep reasoning right so then we're
asking the question of conceptually how
do you train an energy based model right
so energy based model is a function with
a scalar output just a
number you give it two inputs X and Y M
and it tells you whether Y is compatible
with X or not X You observe let's say
it's a prompt an image a video whatever
and why is a proposal for an answer a
continuation of video um you know
whatever and it tells you whether Y is
compatible with X and the way it tells
you that Y is compatible with X is that
the output of that function will be zero
if Y is compatible with X it would be a
positive number non zero if Y is not
compatible with X okay how do you train
a system like this at a completely
General level is you show it pairs of X
and Y that are compatible equ question
and the corresp answer and you train the
parameters of the big neural net inside
um to produce zero M okay now that
doesn't completely work because the
system might decide well I'm just going
to say zero for everything so now you
have to have a process to make sure that
for a a wrong y the energy would be
larger than zero and there you have two
options one is contrastive Method so
contrastive method is you show an X and
a bad
Y and you tell the system well that's
you know give a high energy to this like
push up the energy right change the
weights in the neural net that confus
the energy so that it goes
up um so that's contrasting methods the
problem with this is if the space of Y
is large the number of such contrasted
samples you're going to have to show is
gigantic but people do this they they do
this when you train a system with RF
basically what you're training is what's
called a reward model which is basically
an objective function that tells you
whether an answer is good or bad and
that's basically exactly what what this
is so we already do this to some extent
we're just not using it for inference
we're just using it for training um uh
there is another set of methods which
are non-contrastive and I prefer those
uh and those non-contrastive method
basically
say uh okay the energy function needs to
have low energy on pairs of xys that are
compatible that come from your training
set how do you make sure that the energy
is going to be higher everywhere
else and the way you do this is by um
having a regularizer a Criterion a term
in your cost function that basically
minimizes the volume of space that can
take low
energy and the precise way to do this is
all kinds of different specific ways to
do this depending on the architecture
but that's the basic principle so that
if you push down the energy function for
particular regions in the XY space it
will automatically go up in other places
because there's only a limited volume of
space that can take low energy okay by
the construction of the system or by the
regularizer regularizing function we've
been talking very generally but what is
a good X and a good Y what is a good
representation of X and Y cuz we've been
talking about language and if you just
take language directly that presumably
is not good so there has to be some kind
of abstract representation of
ideas yeah so you I mean you can do this
with language directly um by just you
know X is a text and Y is the
continuation of that text yes um or X is
a question Y is the answer but you're
you're saying that's not going to take
it I mean that's going to do what LMS
are time well no it depends on how you
how the internal structure of the system
is built if the if the internal
structure of the system is built in such
a way that inside of the system there is
a latent variable that's called Z that
uh you can manipulate so as to minimize
the output
energy then that Z can be viewed as a
representation of a good answer that you
can translate into a y that is a good
answer so this kind of system could be
trained in a very similar way very
similar way but you have to have this
way of preventing collapse of of
ensuring that you know there is high
energy for things you don't train it on
um and and currently it's it's very
implicit in llm it's done in a way that
people don't realize it's being done but
it is being done is is due to the fact
that when you give a high probability to
a
word automatically you give low
probability to other words because you
only have a finite amount of probability
to go around right there to some to one
um so when you minimize the cross
entropy or whatever when you train the
your llm to produce the to predict the
next word uh you're increasing the
probability your system will give to the
correct word but you're also decreasing
the probability will give to the
incorrect words now indirectly that
gives a low probability to a high
probability to sequences of words that
are good and low probability to
sequences of words that are bad but it's
very indirect and it's not it's not
obvious why this actually works at all
but um because you're not doing it on
the joint probability of all the symbols
in a in a sequence you're just doing it
kind
of you sort of factorize that
probability in terms of conditional
probabilities over successive tokens so
how do you do this for visual data so
we've been doing this with all JEA
architectures basically the joint Bing
IA so uh there are the compatibility
between two things is uh you know here's
here's an image or a video here's a
corrupted shifted or transformed version
of that image or video or masked okay
and then uh the energy of the system is
the prediction error of
the
representation uh the the predicted
representation of the Good Thing versus
the actual representation of the good
thing right so so you run the corrupted
image to the system predict the
representation of the the good input
uncorrupted and then compute the
prediction error that's energy of the
system so this system will tell you this
is a
good you know if this is a good image
and this is a corrupted version it will
give you Zero Energy if those two things
are effectively one of them is a
corrupted version of the other give you
a high energy if the if the two images
are completely different and hopefully
that whole process gives you a really
nice compressed representation of of
reality of visual reality and we know it
does because then we use those for
presentations as input to a
classification system that
classification system works really
nicely
okay
5.0 / 5 (0 votes)
Develop the Superpower of Constructive Debating with Dialog Mapping - includes ChatGPT case study
Roadmap for Learning SQL
Can Particles be Quantum Entangled Across Time?
【機器學習2021】自注意力機制 (Self-attention) (上)
Simple Introduction to Large Language Models (LLMs)
What is Retrieval-Augmented Generation (RAG)?