Can LLMs reason? | Yann LeCun and Lex Fridman
Summary
TLDRThe transcript discusses the limitations of large language models (LLMs) in reasoning and the potential for future AI systems. It highlights that LLMs allocate a constant amount of computation per token produced, which doesn't scale with the complexity of the question. The conversation suggests that future dialogue systems will incorporate planning and reasoning, with a shift from autoregressive models to systems that optimize abstract representations before generating text. The process involves training an energy-based model to distinguish good answers from bad ones, using techniques like contrastive methods and regularizers. The transcript also touches on the concept of system one and system two in human psychology, drawing parallels with AI's potential development towards more complex, deliberate problem-solving.
Takeaways
- ð§ The reasoning in large language models (LLMs) is considered primitive due to the constant amount of computation spent per token produced.
- ð The computation does not adjust based on the complexity of the question, whether it's simple, complicated, or impossible to answer.
- ð The future of dialogue systems may involve planning and reasoning before producing an answer, moving away from auto-regressive LMs.
- ð A well-constructed world model is essential for building systems that can perform complex reasoning and planning.
- ð ïž The process of creating such systems may involve an optimization process, searching for an answer that minimizes a cost function, representing the quality of the answer.
- ð¯ Energy-based models could be a potential approach, where the system outputs a scalar value indicating the goodness of an answer for a given prompt.
- ð The training of energy-based models involves showing compatible and non-compatible pairs of inputs and outputs, adjusting the neural network to produce appropriate energy values.
- ð Contrastive methods and non-contrastive methods are two approaches to training, with the latter using regularization to ensure high energy for incompatible inputs.
- ð The concept of latent variables could allow for the manipulation of an abstract representation to minimize output energy, leading to a good answer.
- ð¢ The indirect nature of training LLMs currently happens through probability adjustments, favoring correct words and sequences while downplaying incorrect ones.
- ðŒïž For visual data, the energy of a system can be represented by the prediction error between a corrupted input and its uncorrupted representation.
Q & A
What is the main limitation of the reasoning process in large language models (LLMs)?
-The main limitation is that the amount of computation spent per token produced is constant, which means that the system does not allocate more resources to complex problems or questions as it would for simpler ones.
How does human reasoning differ from the reasoning process in LLMs?
-Human reasoning involves spending more time on complex problems, with an iterative and hierarchical approach, while LLMs do not adjust the amount of computation based on the complexity of the question.
What is the significance of a well-constructed world model in developing reasoning and planning abilities for dialogue systems?
-A well-constructed world model allows for the development of mechanisms like persistent long-term memory and more advanced reasoning. It helps the system to plan and optimize its responses before producing them, leading to more efficient and accurate outputs.
How does the proposed blueprint for future dialogue systems differ from autoregressive LLMs?
-The proposed blueprint involves non-autoregressive processes where the system thinks about and plans its answer using an abstract representation of thought before converting it into text, leading to more efficient and deliberate responses.
What is the role of an energy-based model in this context?
-An energy-based model is used to measure the compatibility of a proposed answer with a given prompt. It outputs a scalar value that indicates the 'goodness' of the answer, which can be optimized to produce better responses.
How is the representation of an answer optimized in the abstract space?
-The optimization process involves iteratively refining the abstract representation of the answer to minimize the output of the energy-based model, leading to a more accurate and well-thought-out response.
What are the two main methods for training an energy-based model?
-The two main methods are contrastive methods, where the system is shown compatible and incompatible pairs and adjusts its weights accordingly, and non-contrastive methods, which use a regularizer to ensure higher energy for incompatible pairs.
How does the concept of system one and system two in human psychology relate to the capabilities of LLMs?
-System one corresponds to tasks that can be accomplished without deliberate thought, similar to the instinctive responses of LLMs. System two involves tasks that require planning and deep thinking, which is what LLMs currently lack and need to develop for more advanced reasoning and problem-solving.
What is the main inefficiency in the current method of generating hypotheses in LLMs?
-The main inefficiency is that LLMs have to generate and evaluate a large number of possible sequences of tokens, which is a wasteful use of computation compared to optimizing in a continuous, differentiable space.
How can the energy function be trained to distinguish between good and bad answers?
-The energy function can be trained by showing it pairs of compatible and incompatible inputs and answers, adjusting the neural network weights to produce lower energy for good answers and higher energy for bad ones, using techniques like contrastive methods and regularizers.
What is an example of how energy-based models are used in visual data processing?
-In visual data processing, the energy of the system is represented by the prediction error of the representation when comparing a corrupted version of an image or video to the original, uncorrupted version. This helps in creating a compressed and accurate representation of visual reality.
Outlines
ð€ Primitive Reasoning in LLMs
This paragraph discusses the limitations of reasoning in large language models (LLMs) due to the constant amount of computation spent per token produced. It highlights that regardless of the complexity of the question, the system devotes a fixed computational effort to generating an answer. The speaker contrasts this with human reasoning, which involves more time and iterative processes for complex problems. The paragraph suggests that future advancements may include building upon the low-level world model with mechanisms like persistent long-term memory and reasoning, which are essential for more advanced dialogue systems.
ð The Future of Dialog Systems: Energy-Based Models
The speaker envisions the future of dialog systems as energy-based models that measure the quality of an answer for a given prompt. These models would operate on a scalar output, with a low value indicating a good answer and a high value indicating a poor one. The process involves optimization in an abstract representation space rather than searching through possible text strings. The speaker describes a system where an abstract thought is optimized and then fed into an auto-regressive decoder to produce text. This approach allows for more efficient computation and planning of responses, differing from the auto-regressive language models currently in use.
ð Training Energy-Based Models and Conceptual Understanding
This paragraph delves into the conceptual framework of training energy-based models, which assess the compatibility between a prompt and a proposed answer. The speaker explains that these models are trained on pairs of compatible inputs and outputs, using a neural network to produce a scalar output that indicates compatibility. To ensure the model doesn't output a zero value for all inputs, contrastive methods and non-contrastive methods are used, with the latter involving a regularizer to ensure higher energy for incompatible pairs. The speaker also discusses the importance of an abstract representation of ideas, rather than direct language input, for effective training and reasoning in these models.
ðŒïž Visual Data and Energy Function in JEA Architectures
The final paragraph explores the application of energy functions in joint embedding architectures (JEA) for visual data. The energy of the system is defined as the prediction error between a corrupted input and the representation of the original, uncorrupted input. This method provides a compressed representation of visual reality, which is effective for classification tasks. The speaker contrasts this approach with the indirect probability adjustments in language models, where increasing the probability of the correct word also decreases the probability of incorrect words, and emphasizes the benefits of a direct compatibility measure for visual data.
Mindmap
Keywords
ð¡reasoning
ð¡computation
ð¡token
ð¡prediction network
ð¡hierarchical element
ð¡persistent long-term memory
ð¡inference of latent variables
ð¡energy-based model
ð¡optimization
ð¡latent variables
ð¡system one and system two
Highlights
The reasoning in large language models (LLMs) is considered primitive due to the constant amount of computation spent per token produced.
The computation in LLMs does not adjust based on the complexity of the question, whether it's simple, complicated, or impossible to answer.
Human reasoning involves spending more time on complex problems, with an iterative and hierarchical approach, unlike the constant computation model of LLMs.
There is potential for building mechanisms like persistent long-term memory and reasoning on top of the low-level world model provided by language.
Future dialogue systems may involve planning and optimization before producing an answer, which is different from the current auto-regressive LMs.
The concept of system one and system two in humans is introduced, with system one being tasks accomplished without deliberate thought and system two requiring planning and thought.
LLMs currently lack the ability to use an internal world model for deliberate planning and thought, unlike human system two tasks.
The future of dialogue systems may involve non-auto-regressive prediction and optimization of latent variables in abstract representation spaces.
The idea of an energy-based model is introduced, where the model output is a scalar number representing the quality of an answer for a given prompt.
Optimization processes in continuous spaces are suggested to be more efficient than generating and selecting from many discrete sequences of tokens.
The concept of training an energy-based model with compatible and incompatible pairs of inputs and outputs is discussed.
Contrastive methods and non-contrastive methods are explained as approaches to train energy-based models with different sample requirements.
The importance of an abstract representation of ideas is emphasized for efficient reasoning and planning in dialogue systems.
The indirect method of training LLMs through probability distribution over tokens is highlighted, including its limitations.
The potential application of energy-based models in visual data processing is mentioned, using joint embedding architectures.
The energy function's role in determining the compatibility between inputs and outputs is discussed, with the goal of producing a compressed representation of reality.
Transcripts
the type of reasoning that takes place
in llm is very very primitive and the
reason you can tell is primitive is
because the amount of computation that
is spent per token produced is constant
so if you ask a question and that
question has an answer in a given number
of token the amount of competition
devoted to Computing that answer can be
exactly estimated it's like you know
it's how it's the the size of the
prediction Network you know with its 36
layers or 92 layers or whatever it is uh
multiply by number of tokens that's it
and so essentially it doesn't matter if
the question being asked is is simple to
answer complicated to answer impossible
to answer because it's undecidable or
something um the amount of computation
the system will be able to devote to
that to the answer is constant or is
proportional to the number of token
produced in the answer right this is not
the way we work the way we reason is
that when we're faced with a complex
problem or complex question we spend
more time trying to solve it and answer
it right because it's more difficult
there's a prediction element there's a
iterative element where you're like
uh adjusting your understanding of a
thing by going over over and over and
over there's a hierarchical element so
on does this mean that a fundamental
flaw of llms or does it mean
that there's more part to that
question now you're just behaving like
an
llm immediately answer no that that it's
just the lowlevel world model on top of
which we can then build some of these
kinds of mechanisms like you said
persistent long-term memory
or uh reasoning so on but we need that
world model that comes from language is
it maybe it is not so difficult to build
this kind of uh reasoning system on top
of a well constructed World model OKAY
whether it's difficult or not the near
future will will say because a lot of
people are working on reasoning and
planning abilities for for dialogue
systems um I mean if we're even if we
restrict ourselves to
language uh just having the ability to
plan your answer before you
answer uh in terms that are not
necessarily linked with the language
you're going to use to produce the
answer right so this idea of this mental
model that allows you to plan what
you're going to say before you say it MH
um that is very important I think
there's going to be a lot of systems
over the next few years are going to
have this capability but the blueprint
of those systems will be extremely
different from Auto regressive LMS so
um it's the same difference as has the
difference between what psychology is
called system one and system two in
humans right so system one is the type
of task that you can accomplish without
like deliberately consciously think
about how you do them you just do them
you've done them enough that you can
just do it subconsciously right without
thinking about them if you're an
experienced driver you can drive without
really thinking about it and you can
talk to someone at the same time or
listen to the radio right um if you are
a very experienced chest player you can
play against a non-experienced CH player
without really thinking either you just
recognize the pattern and you play mhm
right that's system one um so all the
things that you do instinctively without
really having to deliberately plan and
think about it and then there is all
task what you need to plan so if you are
a not to experienced uh chess player or
you are experienced where you play
against another experienced chest player
you think about all kinds of options
right you you think about it for a while
right and you you you're much better if
you have time to think about it than you
are if you are if you play Blitz uh with
limited time so and um so this type of
deliberate uh planning which uses your
internal World model um that system to
this is what LMS currently cannot do so
how how do we get them to do this right
how do we build a system that can do
this kind of planning that or reasoning
that devotes more resources to complex
part problems than two simple problems
and it's not going to be Auto regressive
prediction of tokens it's going to be
more something akin to inference of
latent variables in um you know what
used to be called probalistic models or
graphical models and things of that type
so basically the principle is like this
you you know the prompt is like observed
uh variables mhm and what you're what
the model
does is that it's basically a
measure of it can measure to what extent
an answer is a good answer for a prompt
okay so think of it as some gigantic
Neal net but it's got only one output
and that output is a scalar number which
is let's say zero if the answer is a
good answer for the question and a large
number if the answer is not a good
answer for the question imagine you had
this model if you had such a model you
could use it to produce good answers the
way you would do
is you know produce the prompt and then
search through the space of possible
answers for one that minimizes that
number um that's called an energy based
model but that energy based model would
need the the model constructed by the
llm well so uh really what you need to
do would be to not uh search over
possible strings of text that minimize
that uh energy but what you would do it
do this in abstract representation space
so in in sort of the space of abstract
thoughts you would elaborate a thought
right using this process of minimizing
the output of your your model okay which
is just a scalar um it's an optimization
process right so now the the way the
system produces its answer is through
optimization um by you know minimizing
an objective function basically right uh
and this is we're talking about
inference not talking about training
right the system has been trained
already so now we have an abstract
representation of the thought of the
answer representation of the answer we
feed that to basically an auto
regressive decoder uh which can be very
simple that turns this into a text that
expresses this thought okay so that that
in my opinion is the blueprint of future
dialog systems um they will think about
their answer plan their answer by
optimization before turning it into text
uh and that is turning complete can you
explain exactly what the optimization
problem there is like what's the
objective function just Linger on it you
you kind of briefly described it but
over what space are you optimizing the
space of
representations goes abstract
representation abstract repres so you
have an abstract representation inside
the system you have a prompt The Prompt
goes through an encoder produces a
representation perhaps goes through a
predictor that predicts a representation
of the answer of the proper answer but
that representation may not be a good
answer because there might there might
be some complicated reasoning you need
to do right so um so then you have
another process that takes the
representation of the answers and
modifies it so as to
minimize uh a cost function that
measures to what extent the answer is a
good answer for the question now we we
sort of ignore the the fact for I mean
the the issue for a moment of how you
train that system to measure whether an
answer is a good answer for for but
suppose such a system could be created
but what's the process this kind of
search like process it's a optimization
process you can do this if if the entire
system is
differentiable that scalar output is the
result of you know running through some
neural net MH uh running the answer the
representation of the answer to some
neural net then by GR
by back propag back propagating
gradients you can figure out like how to
modify the representation of the answer
so as to minimize that so that's still
gradient based it's gradient based
inference so now you have a
representation of the answer in abstract
space now you can turn it into
text right and the cool thing about this
is that the representation now can be
optimized through gr and descent but
also is independent of the language in
which you're going to express the
answer right so you're operating in the
substract representation I mean this
goes back to the Joint embedding that is
better to work in the uh in the space of
I don't know to romanticize the notion
like space of Concepts versus yeah the
space of
concrete sensory information
right okay but this can can this do
something like reasoning which is what
we're talking about well not really in a
only in a very simple way I mean
basically you can think of those things
as doing the kind of optimization I was
I was talking about except they optimize
in the discrete space which is the space
of possible sequences of of tokens and
they do it they do this optimization in
a horribly inefficient way which is
generate a lot of hypothesis and then
select the best ones and that's
incredibly wasteful in terms of uh
computation because you have you run you
basically have to run your LM for like
every possible you know Genera sequence
um and it's incredibly wasteful
um so it's much better to do an
optimization in continuous space where
you can do gr and descent as opposed to
like generate tons of things and then
select the best you just iteratively
refine your answer to to go towards the
best right that's much more efficient
but you can only do this in continuous
spaces with differentiable functions
you're talking about the reasoning like
ability to think deeply or to reason
deeply how do you know what
is an
answer uh that's better or worse based
on deep reasoning right so then we're
asking the question of conceptually how
do you train an energy based model right
so energy based model is a function with
a scalar output just a
number you give it two inputs X and Y M
and it tells you whether Y is compatible
with X or not X You observe let's say
it's a prompt an image a video whatever
and why is a proposal for an answer a
continuation of video um you know
whatever and it tells you whether Y is
compatible with X and the way it tells
you that Y is compatible with X is that
the output of that function will be zero
if Y is compatible with X it would be a
positive number non zero if Y is not
compatible with X okay how do you train
a system like this at a completely
General level is you show it pairs of X
and Y that are compatible equ question
and the corresp answer and you train the
parameters of the big neural net inside
um to produce zero M okay now that
doesn't completely work because the
system might decide well I'm just going
to say zero for everything so now you
have to have a process to make sure that
for a a wrong y the energy would be
larger than zero and there you have two
options one is contrastive Method so
contrastive method is you show an X and
a bad
Y and you tell the system well that's
you know give a high energy to this like
push up the energy right change the
weights in the neural net that confus
the energy so that it goes
up um so that's contrasting methods the
problem with this is if the space of Y
is large the number of such contrasted
samples you're going to have to show is
gigantic but people do this they they do
this when you train a system with RF
basically what you're training is what's
called a reward model which is basically
an objective function that tells you
whether an answer is good or bad and
that's basically exactly what what this
is so we already do this to some extent
we're just not using it for inference
we're just using it for training um uh
there is another set of methods which
are non-contrastive and I prefer those
uh and those non-contrastive method
basically
say uh okay the energy function needs to
have low energy on pairs of xys that are
compatible that come from your training
set how do you make sure that the energy
is going to be higher everywhere
else and the way you do this is by um
having a regularizer a Criterion a term
in your cost function that basically
minimizes the volume of space that can
take low
energy and the precise way to do this is
all kinds of different specific ways to
do this depending on the architecture
but that's the basic principle so that
if you push down the energy function for
particular regions in the XY space it
will automatically go up in other places
because there's only a limited volume of
space that can take low energy okay by
the construction of the system or by the
regularizer regularizing function we've
been talking very generally but what is
a good X and a good Y what is a good
representation of X and Y cuz we've been
talking about language and if you just
take language directly that presumably
is not good so there has to be some kind
of abstract representation of
ideas yeah so you I mean you can do this
with language directly um by just you
know X is a text and Y is the
continuation of that text yes um or X is
a question Y is the answer but you're
you're saying that's not going to take
it I mean that's going to do what LMS
are time well no it depends on how you
how the internal structure of the system
is built if the if the internal
structure of the system is built in such
a way that inside of the system there is
a latent variable that's called Z that
uh you can manipulate so as to minimize
the output
energy then that Z can be viewed as a
representation of a good answer that you
can translate into a y that is a good
answer so this kind of system could be
trained in a very similar way very
similar way but you have to have this
way of preventing collapse of of
ensuring that you know there is high
energy for things you don't train it on
um and and currently it's it's very
implicit in llm it's done in a way that
people don't realize it's being done but
it is being done is is due to the fact
that when you give a high probability to
a
word automatically you give low
probability to other words because you
only have a finite amount of probability
to go around right there to some to one
um so when you minimize the cross
entropy or whatever when you train the
your llm to produce the to predict the
next word uh you're increasing the
probability your system will give to the
correct word but you're also decreasing
the probability will give to the
incorrect words now indirectly that
gives a low probability to a high
probability to sequences of words that
are good and low probability to
sequences of words that are bad but it's
very indirect and it's not it's not
obvious why this actually works at all
but um because you're not doing it on
the joint probability of all the symbols
in a in a sequence you're just doing it
kind
of you sort of factorize that
probability in terms of conditional
probabilities over successive tokens so
how do you do this for visual data so
we've been doing this with all JEA
architectures basically the joint Bing
IA so uh there are the compatibility
between two things is uh you know here's
here's an image or a video here's a
corrupted shifted or transformed version
of that image or video or masked okay
and then uh the energy of the system is
the prediction error of
the
representation uh the the predicted
representation of the Good Thing versus
the actual representation of the good
thing right so so you run the corrupted
image to the system predict the
representation of the the good input
uncorrupted and then compute the
prediction error that's energy of the
system so this system will tell you this
is a
good you know if this is a good image
and this is a corrupted version it will
give you Zero Energy if those two things
are effectively one of them is a
corrupted version of the other give you
a high energy if the if the two images
are completely different and hopefully
that whole process gives you a really
nice compressed representation of of
reality of visual reality and we know it
does because then we use those for
presentations as input to a
classification system that
classification system works really
nicely
okay
5.0 / 5 (0 votes)
Simple Introduction to Large Language Models (LLMs)
Mark Zuckerberg - Llama 3, $10B Models, Caesar Augustus, & 1 GW Datacenters
Why the Future of AI & Computers Will Be Analog
New OpenAI Model 'Imminent' and AI Stakes Get Raised (plus Med Gemini, GPT 2 Chatbot and Scale AI)
ChatGPT Can Now Talk Like a Human [Latest Updates]
Jon Stewart On The False Promises of AI | The Daily Show