# Can LLMs reason? | Yann LeCun and Lex Fridman

### Summary

TLDRThe transcript discusses the limitations of large language models (LLMs) in reasoning and the potential for future AI systems. It highlights that LLMs allocate a constant amount of computation per token produced, which doesn't scale with the complexity of the question. The conversation suggests that future dialogue systems will incorporate planning and reasoning, with a shift from autoregressive models to systems that optimize abstract representations before generating text. The process involves training an energy-based model to distinguish good answers from bad ones, using techniques like contrastive methods and regularizers. The transcript also touches on the concept of system one and system two in human psychology, drawing parallels with AI's potential development towards more complex, deliberate problem-solving.

### Takeaways

- 🧠 The reasoning in large language models (LLMs) is considered primitive due to the constant amount of computation spent per token produced.
- 🔄 The computation does not adjust based on the complexity of the question, whether it's simple, complicated, or impossible to answer.
- 🚀 The future of dialogue systems may involve planning and reasoning before producing an answer, moving away from auto-regressive LMs.
- 🌐 A well-constructed world model is essential for building systems that can perform complex reasoning and planning.
- 🛠️ The process of creating such systems may involve an optimization process, searching for an answer that minimizes a cost function, representing the quality of the answer.
- 🎯 Energy-based models could be a potential approach, where the system outputs a scalar value indicating the goodness of an answer for a given prompt.
- 🔄 The training of energy-based models involves showing compatible and non-compatible pairs of inputs and outputs, adjusting the neural network to produce appropriate energy values.
- 🌟 Contrastive methods and non-contrastive methods are two approaches to training, with the latter using regularization to ensure high energy for incompatible inputs.
- 📈 The concept of latent variables could allow for the manipulation of an abstract representation to minimize output energy, leading to a good answer.
- 🔢 The indirect nature of training LLMs currently happens through probability adjustments, favoring correct words and sequences while downplaying incorrect ones.
- 🖼️ For visual data, the energy of a system can be represented by the prediction error between a corrupted input and its uncorrupted representation.

### Q & A

### What is the main limitation of the reasoning process in large language models (LLMs)?

-The main limitation is that the amount of computation spent per token produced is constant, which means that the system does not allocate more resources to complex problems or questions as it would for simpler ones.

### How does human reasoning differ from the reasoning process in LLMs?

-Human reasoning involves spending more time on complex problems, with an iterative and hierarchical approach, while LLMs do not adjust the amount of computation based on the complexity of the question.

### What is the significance of a well-constructed world model in developing reasoning and planning abilities for dialogue systems?

-A well-constructed world model allows for the development of mechanisms like persistent long-term memory and more advanced reasoning. It helps the system to plan and optimize its responses before producing them, leading to more efficient and accurate outputs.

### How does the proposed blueprint for future dialogue systems differ from autoregressive LLMs?

-The proposed blueprint involves non-autoregressive processes where the system thinks about and plans its answer using an abstract representation of thought before converting it into text, leading to more efficient and deliberate responses.

### What is the role of an energy-based model in this context?

-An energy-based model is used to measure the compatibility of a proposed answer with a given prompt. It outputs a scalar value that indicates the 'goodness' of the answer, which can be optimized to produce better responses.

### How is the representation of an answer optimized in the abstract space?

-The optimization process involves iteratively refining the abstract representation of the answer to minimize the output of the energy-based model, leading to a more accurate and well-thought-out response.

### What are the two main methods for training an energy-based model?

-The two main methods are contrastive methods, where the system is shown compatible and incompatible pairs and adjusts its weights accordingly, and non-contrastive methods, which use a regularizer to ensure higher energy for incompatible pairs.

### How does the concept of system one and system two in human psychology relate to the capabilities of LLMs?

-System one corresponds to tasks that can be accomplished without deliberate thought, similar to the instinctive responses of LLMs. System two involves tasks that require planning and deep thinking, which is what LLMs currently lack and need to develop for more advanced reasoning and problem-solving.

### What is the main inefficiency in the current method of generating hypotheses in LLMs?

-The main inefficiency is that LLMs have to generate and evaluate a large number of possible sequences of tokens, which is a wasteful use of computation compared to optimizing in a continuous, differentiable space.

### How can the energy function be trained to distinguish between good and bad answers?

-The energy function can be trained by showing it pairs of compatible and incompatible inputs and answers, adjusting the neural network weights to produce lower energy for good answers and higher energy for bad ones, using techniques like contrastive methods and regularizers.

### What is an example of how energy-based models are used in visual data processing?

-In visual data processing, the energy of the system is represented by the prediction error of the representation when comparing a corrupted version of an image or video to the original, uncorrupted version. This helps in creating a compressed and accurate representation of visual reality.

### Outlines

### 🤖 Primitive Reasoning in LLMs

This paragraph discusses the limitations of reasoning in large language models (LLMs) due to the constant amount of computation spent per token produced. It highlights that regardless of the complexity of the question, the system devotes a fixed computational effort to generating an answer. The speaker contrasts this with human reasoning, which involves more time and iterative processes for complex problems. The paragraph suggests that future advancements may include building upon the low-level world model with mechanisms like persistent long-term memory and reasoning, which are essential for more advanced dialogue systems.

### 🌟 The Future of Dialog Systems: Energy-Based Models

The speaker envisions the future of dialog systems as energy-based models that measure the quality of an answer for a given prompt. These models would operate on a scalar output, with a low value indicating a good answer and a high value indicating a poor one. The process involves optimization in an abstract representation space rather than searching through possible text strings. The speaker describes a system where an abstract thought is optimized and then fed into an auto-regressive decoder to produce text. This approach allows for more efficient computation and planning of responses, differing from the auto-regressive language models currently in use.

### 📈 Training Energy-Based Models and Conceptual Understanding

This paragraph delves into the conceptual framework of training energy-based models, which assess the compatibility between a prompt and a proposed answer. The speaker explains that these models are trained on pairs of compatible inputs and outputs, using a neural network to produce a scalar output that indicates compatibility. To ensure the model doesn't output a zero value for all inputs, contrastive methods and non-contrastive methods are used, with the latter involving a regularizer to ensure higher energy for incompatible pairs. The speaker also discusses the importance of an abstract representation of ideas, rather than direct language input, for effective training and reasoning in these models.

### 🖼️ Visual Data and Energy Function in JEA Architectures

The final paragraph explores the application of energy functions in joint embedding architectures (JEA) for visual data. The energy of the system is defined as the prediction error between a corrupted input and the representation of the original, uncorrupted input. This method provides a compressed representation of visual reality, which is effective for classification tasks. The speaker contrasts this approach with the indirect probability adjustments in language models, where increasing the probability of the correct word also decreases the probability of incorrect words, and emphasizes the benefits of a direct compatibility measure for visual data.

### Mindmap

### Keywords

### 💡reasoning

### 💡computation

### 💡token

### 💡prediction network

### 💡hierarchical element

### 💡persistent long-term memory

### 💡inference of latent variables

### 💡energy-based model

### 💡optimization

### 💡latent variables

### 💡system one and system two

### Highlights

The reasoning in large language models (LLMs) is considered primitive due to the constant amount of computation spent per token produced.

The computation in LLMs does not adjust based on the complexity of the question, whether it's simple, complicated, or impossible to answer.

Human reasoning involves spending more time on complex problems, with an iterative and hierarchical approach, unlike the constant computation model of LLMs.

There is potential for building mechanisms like persistent long-term memory and reasoning on top of the low-level world model provided by language.

Future dialogue systems may involve planning and optimization before producing an answer, which is different from the current auto-regressive LMs.

The concept of system one and system two in humans is introduced, with system one being tasks accomplished without deliberate thought and system two requiring planning and thought.

LLMs currently lack the ability to use an internal world model for deliberate planning and thought, unlike human system two tasks.

The future of dialogue systems may involve non-auto-regressive prediction and optimization of latent variables in abstract representation spaces.

The idea of an energy-based model is introduced, where the model output is a scalar number representing the quality of an answer for a given prompt.

Optimization processes in continuous spaces are suggested to be more efficient than generating and selecting from many discrete sequences of tokens.

The concept of training an energy-based model with compatible and incompatible pairs of inputs and outputs is discussed.

Contrastive methods and non-contrastive methods are explained as approaches to train energy-based models with different sample requirements.

The importance of an abstract representation of ideas is emphasized for efficient reasoning and planning in dialogue systems.

The indirect method of training LLMs through probability distribution over tokens is highlighted, including its limitations.

The potential application of energy-based models in visual data processing is mentioned, using joint embedding architectures.

The energy function's role in determining the compatibility between inputs and outputs is discussed, with the goal of producing a compressed representation of reality.

### Transcripts

the type of reasoning that takes place

in llm is very very primitive and the

reason you can tell is primitive is

because the amount of computation that

is spent per token produced is constant

so if you ask a question and that

question has an answer in a given number

of token the amount of competition

devoted to Computing that answer can be

exactly estimated it's like you know

it's how it's the the size of the

prediction Network you know with its 36

layers or 92 layers or whatever it is uh

multiply by number of tokens that's it

and so essentially it doesn't matter if

the question being asked is is simple to

answer complicated to answer impossible

to answer because it's undecidable or

something um the amount of computation

the system will be able to devote to

that to the answer is constant or is

proportional to the number of token

produced in the answer right this is not

the way we work the way we reason is

that when we're faced with a complex

problem or complex question we spend

more time trying to solve it and answer

it right because it's more difficult

there's a prediction element there's a

iterative element where you're like

uh adjusting your understanding of a

thing by going over over and over and

over there's a hierarchical element so

on does this mean that a fundamental

flaw of llms or does it mean

that there's more part to that

question now you're just behaving like

an

llm immediately answer no that that it's

just the lowlevel world model on top of

which we can then build some of these

kinds of mechanisms like you said

persistent long-term memory

or uh reasoning so on but we need that

world model that comes from language is

it maybe it is not so difficult to build

this kind of uh reasoning system on top

of a well constructed World model OKAY

whether it's difficult or not the near

future will will say because a lot of

people are working on reasoning and

planning abilities for for dialogue

systems um I mean if we're even if we

restrict ourselves to

language uh just having the ability to

plan your answer before you

answer uh in terms that are not

necessarily linked with the language

you're going to use to produce the

answer right so this idea of this mental

model that allows you to plan what

you're going to say before you say it MH

um that is very important I think

there's going to be a lot of systems

over the next few years are going to

have this capability but the blueprint

of those systems will be extremely

different from Auto regressive LMS so

um it's the same difference as has the

difference between what psychology is

called system one and system two in

humans right so system one is the type

of task that you can accomplish without

like deliberately consciously think

about how you do them you just do them

you've done them enough that you can

just do it subconsciously right without

thinking about them if you're an

experienced driver you can drive without

really thinking about it and you can

talk to someone at the same time or

listen to the radio right um if you are

a very experienced chest player you can

play against a non-experienced CH player

without really thinking either you just

recognize the pattern and you play mhm

right that's system one um so all the

things that you do instinctively without

really having to deliberately plan and

think about it and then there is all

task what you need to plan so if you are

a not to experienced uh chess player or

you are experienced where you play

against another experienced chest player

you think about all kinds of options

right you you think about it for a while

right and you you you're much better if

you have time to think about it than you

are if you are if you play Blitz uh with

limited time so and um so this type of

deliberate uh planning which uses your

internal World model um that system to

this is what LMS currently cannot do so

how how do we get them to do this right

how do we build a system that can do

this kind of planning that or reasoning

that devotes more resources to complex

part problems than two simple problems

and it's not going to be Auto regressive

prediction of tokens it's going to be

more something akin to inference of

latent variables in um you know what

used to be called probalistic models or

graphical models and things of that type

so basically the principle is like this

you you know the prompt is like observed

uh variables mhm and what you're what

the model

does is that it's basically a

measure of it can measure to what extent

an answer is a good answer for a prompt

okay so think of it as some gigantic

Neal net but it's got only one output

and that output is a scalar number which

is let's say zero if the answer is a

good answer for the question and a large

number if the answer is not a good

answer for the question imagine you had

this model if you had such a model you

could use it to produce good answers the

way you would do

is you know produce the prompt and then

search through the space of possible

answers for one that minimizes that

number um that's called an energy based

model but that energy based model would

need the the model constructed by the

llm well so uh really what you need to

do would be to not uh search over

possible strings of text that minimize

that uh energy but what you would do it

do this in abstract representation space

so in in sort of the space of abstract

thoughts you would elaborate a thought

right using this process of minimizing

the output of your your model okay which

is just a scalar um it's an optimization

process right so now the the way the

system produces its answer is through

optimization um by you know minimizing

an objective function basically right uh

and this is we're talking about

inference not talking about training

right the system has been trained

already so now we have an abstract

representation of the thought of the

answer representation of the answer we

feed that to basically an auto

regressive decoder uh which can be very

simple that turns this into a text that

expresses this thought okay so that that

in my opinion is the blueprint of future

dialog systems um they will think about

their answer plan their answer by

optimization before turning it into text

uh and that is turning complete can you

explain exactly what the optimization

problem there is like what's the

objective function just Linger on it you

you kind of briefly described it but

over what space are you optimizing the

space of

representations goes abstract

representation abstract repres so you

have an abstract representation inside

the system you have a prompt The Prompt

goes through an encoder produces a

representation perhaps goes through a

predictor that predicts a representation

of the answer of the proper answer but

that representation may not be a good

answer because there might there might

be some complicated reasoning you need

to do right so um so then you have

another process that takes the

representation of the answers and

modifies it so as to

minimize uh a cost function that

measures to what extent the answer is a

good answer for the question now we we

sort of ignore the the fact for I mean

the the issue for a moment of how you

train that system to measure whether an

answer is a good answer for for but

suppose such a system could be created

but what's the process this kind of

search like process it's a optimization

process you can do this if if the entire

system is

differentiable that scalar output is the

result of you know running through some

neural net MH uh running the answer the

representation of the answer to some

neural net then by GR

by back propag back propagating

gradients you can figure out like how to

modify the representation of the answer

so as to minimize that so that's still

gradient based it's gradient based

inference so now you have a

representation of the answer in abstract

space now you can turn it into

text right and the cool thing about this

is that the representation now can be

optimized through gr and descent but

also is independent of the language in

which you're going to express the

answer right so you're operating in the

substract representation I mean this

goes back to the Joint embedding that is

better to work in the uh in the space of

I don't know to romanticize the notion

like space of Concepts versus yeah the

space of

concrete sensory information

right okay but this can can this do

something like reasoning which is what

we're talking about well not really in a

only in a very simple way I mean

basically you can think of those things

as doing the kind of optimization I was

I was talking about except they optimize

in the discrete space which is the space

of possible sequences of of tokens and

they do it they do this optimization in

a horribly inefficient way which is

generate a lot of hypothesis and then

select the best ones and that's

incredibly wasteful in terms of uh

computation because you have you run you

basically have to run your LM for like

every possible you know Genera sequence

um and it's incredibly wasteful

um so it's much better to do an

optimization in continuous space where

you can do gr and descent as opposed to

like generate tons of things and then

select the best you just iteratively

refine your answer to to go towards the

best right that's much more efficient

but you can only do this in continuous

spaces with differentiable functions

you're talking about the reasoning like

ability to think deeply or to reason

deeply how do you know what

is an

answer uh that's better or worse based

on deep reasoning right so then we're

asking the question of conceptually how

do you train an energy based model right

so energy based model is a function with

a scalar output just a

number you give it two inputs X and Y M

and it tells you whether Y is compatible

with X or not X You observe let's say

it's a prompt an image a video whatever

and why is a proposal for an answer a

continuation of video um you know

whatever and it tells you whether Y is

compatible with X and the way it tells

you that Y is compatible with X is that

the output of that function will be zero

if Y is compatible with X it would be a

positive number non zero if Y is not

compatible with X okay how do you train

a system like this at a completely

General level is you show it pairs of X

and Y that are compatible equ question

and the corresp answer and you train the

parameters of the big neural net inside

um to produce zero M okay now that

doesn't completely work because the

system might decide well I'm just going

to say zero for everything so now you

have to have a process to make sure that

for a a wrong y the energy would be

larger than zero and there you have two

options one is contrastive Method so

contrastive method is you show an X and

a bad

Y and you tell the system well that's

you know give a high energy to this like

push up the energy right change the

weights in the neural net that confus

the energy so that it goes

up um so that's contrasting methods the

problem with this is if the space of Y

is large the number of such contrasted

samples you're going to have to show is

gigantic but people do this they they do

this when you train a system with RF

basically what you're training is what's

called a reward model which is basically

an objective function that tells you

whether an answer is good or bad and

that's basically exactly what what this

is so we already do this to some extent

we're just not using it for inference

we're just using it for training um uh

there is another set of methods which

are non-contrastive and I prefer those

uh and those non-contrastive method

basically

say uh okay the energy function needs to

have low energy on pairs of xys that are

compatible that come from your training

set how do you make sure that the energy

is going to be higher everywhere

else and the way you do this is by um

having a regularizer a Criterion a term

in your cost function that basically

minimizes the volume of space that can

take low

energy and the precise way to do this is

all kinds of different specific ways to

do this depending on the architecture

but that's the basic principle so that

if you push down the energy function for

particular regions in the XY space it

will automatically go up in other places

because there's only a limited volume of

space that can take low energy okay by

the construction of the system or by the

regularizer regularizing function we've

been talking very generally but what is

a good X and a good Y what is a good

representation of X and Y cuz we've been

talking about language and if you just

take language directly that presumably

is not good so there has to be some kind

of abstract representation of

ideas yeah so you I mean you can do this

with language directly um by just you

know X is a text and Y is the

continuation of that text yes um or X is

a question Y is the answer but you're

you're saying that's not going to take

it I mean that's going to do what LMS

are time well no it depends on how you

how the internal structure of the system

is built if the if the internal

structure of the system is built in such

a way that inside of the system there is

a latent variable that's called Z that

uh you can manipulate so as to minimize

the output

energy then that Z can be viewed as a

representation of a good answer that you

can translate into a y that is a good

answer so this kind of system could be

trained in a very similar way very

similar way but you have to have this

way of preventing collapse of of

ensuring that you know there is high

energy for things you don't train it on

um and and currently it's it's very

implicit in llm it's done in a way that

people don't realize it's being done but

it is being done is is due to the fact

that when you give a high probability to

a

word automatically you give low

probability to other words because you

only have a finite amount of probability

to go around right there to some to one

um so when you minimize the cross

entropy or whatever when you train the

your llm to produce the to predict the

next word uh you're increasing the

probability your system will give to the

correct word but you're also decreasing

the probability will give to the

incorrect words now indirectly that

gives a low probability to a high

probability to sequences of words that

are good and low probability to

sequences of words that are bad but it's

very indirect and it's not it's not

obvious why this actually works at all

but um because you're not doing it on

the joint probability of all the symbols

in a in a sequence you're just doing it

kind

of you sort of factorize that

probability in terms of conditional

probabilities over successive tokens so

how do you do this for visual data so

we've been doing this with all JEA

architectures basically the joint Bing

IA so uh there are the compatibility

between two things is uh you know here's

here's an image or a video here's a

corrupted shifted or transformed version

of that image or video or masked okay

and then uh the energy of the system is

the prediction error of

the

representation uh the the predicted

representation of the Good Thing versus