Can LLMs reason? | Yann LeCun and Lex Fridman

Lex Clips
13 Mar 202417:54

Summary

TLDRThe transcript discusses the limitations of large language models (LLMs) in reasoning and the potential for future AI systems. It highlights that LLMs allocate a constant amount of computation per token produced, which doesn't scale with the complexity of the question. The conversation suggests that future dialogue systems will incorporate planning and reasoning, with a shift from autoregressive models to systems that optimize abstract representations before generating text. The process involves training an energy-based model to distinguish good answers from bad ones, using techniques like contrastive methods and regularizers. The transcript also touches on the concept of system one and system two in human psychology, drawing parallels with AI's potential development towards more complex, deliberate problem-solving.

Takeaways

  • 🧠 The reasoning in large language models (LLMs) is considered primitive due to the constant amount of computation spent per token produced.
  • 🔄 The computation does not adjust based on the complexity of the question, whether it's simple, complicated, or impossible to answer.
  • 🚀 The future of dialogue systems may involve planning and reasoning before producing an answer, moving away from auto-regressive LMs.
  • 🌐 A well-constructed world model is essential for building systems that can perform complex reasoning and planning.
  • 🛠️ The process of creating such systems may involve an optimization process, searching for an answer that minimizes a cost function, representing the quality of the answer.
  • 🎯 Energy-based models could be a potential approach, where the system outputs a scalar value indicating the goodness of an answer for a given prompt.
  • 🔄 The training of energy-based models involves showing compatible and non-compatible pairs of inputs and outputs, adjusting the neural network to produce appropriate energy values.
  • 🌟 Contrastive methods and non-contrastive methods are two approaches to training, with the latter using regularization to ensure high energy for incompatible inputs.
  • 📈 The concept of latent variables could allow for the manipulation of an abstract representation to minimize output energy, leading to a good answer.
  • 🔢 The indirect nature of training LLMs currently happens through probability adjustments, favoring correct words and sequences while downplaying incorrect ones.
  • 🖼️ For visual data, the energy of a system can be represented by the prediction error between a corrupted input and its uncorrupted representation.

Q & A

  • What is the main limitation of the reasoning process in large language models (LLMs)?

    -The main limitation is that the amount of computation spent per token produced is constant, which means that the system does not allocate more resources to complex problems or questions as it would for simpler ones.

  • How does human reasoning differ from the reasoning process in LLMs?

    -Human reasoning involves spending more time on complex problems, with an iterative and hierarchical approach, while LLMs do not adjust the amount of computation based on the complexity of the question.

  • What is the significance of a well-constructed world model in developing reasoning and planning abilities for dialogue systems?

    -A well-constructed world model allows for the development of mechanisms like persistent long-term memory and more advanced reasoning. It helps the system to plan and optimize its responses before producing them, leading to more efficient and accurate outputs.

  • How does the proposed blueprint for future dialogue systems differ from autoregressive LLMs?

    -The proposed blueprint involves non-autoregressive processes where the system thinks about and plans its answer using an abstract representation of thought before converting it into text, leading to more efficient and deliberate responses.

  • What is the role of an energy-based model in this context?

    -An energy-based model is used to measure the compatibility of a proposed answer with a given prompt. It outputs a scalar value that indicates the 'goodness' of the answer, which can be optimized to produce better responses.

  • How is the representation of an answer optimized in the abstract space?

    -The optimization process involves iteratively refining the abstract representation of the answer to minimize the output of the energy-based model, leading to a more accurate and well-thought-out response.

  • What are the two main methods for training an energy-based model?

    -The two main methods are contrastive methods, where the system is shown compatible and incompatible pairs and adjusts its weights accordingly, and non-contrastive methods, which use a regularizer to ensure higher energy for incompatible pairs.

  • How does the concept of system one and system two in human psychology relate to the capabilities of LLMs?

    -System one corresponds to tasks that can be accomplished without deliberate thought, similar to the instinctive responses of LLMs. System two involves tasks that require planning and deep thinking, which is what LLMs currently lack and need to develop for more advanced reasoning and problem-solving.

  • What is the main inefficiency in the current method of generating hypotheses in LLMs?

    -The main inefficiency is that LLMs have to generate and evaluate a large number of possible sequences of tokens, which is a wasteful use of computation compared to optimizing in a continuous, differentiable space.

  • How can the energy function be trained to distinguish between good and bad answers?

    -The energy function can be trained by showing it pairs of compatible and incompatible inputs and answers, adjusting the neural network weights to produce lower energy for good answers and higher energy for bad ones, using techniques like contrastive methods and regularizers.

  • What is an example of how energy-based models are used in visual data processing?

    -In visual data processing, the energy of the system is represented by the prediction error of the representation when comparing a corrupted version of an image or video to the original, uncorrupted version. This helps in creating a compressed and accurate representation of visual reality.

Outlines

00:00

🤖 Primitive Reasoning in LLMs

This paragraph discusses the limitations of reasoning in large language models (LLMs) due to the constant amount of computation spent per token produced. It highlights that regardless of the complexity of the question, the system devotes a fixed computational effort to generating an answer. The speaker contrasts this with human reasoning, which involves more time and iterative processes for complex problems. The paragraph suggests that future advancements may include building upon the low-level world model with mechanisms like persistent long-term memory and reasoning, which are essential for more advanced dialogue systems.

05:00

🌟 The Future of Dialog Systems: Energy-Based Models

The speaker envisions the future of dialog systems as energy-based models that measure the quality of an answer for a given prompt. These models would operate on a scalar output, with a low value indicating a good answer and a high value indicating a poor one. The process involves optimization in an abstract representation space rather than searching through possible text strings. The speaker describes a system where an abstract thought is optimized and then fed into an auto-regressive decoder to produce text. This approach allows for more efficient computation and planning of responses, differing from the auto-regressive language models currently in use.

10:03

📈 Training Energy-Based Models and Conceptual Understanding

This paragraph delves into the conceptual framework of training energy-based models, which assess the compatibility between a prompt and a proposed answer. The speaker explains that these models are trained on pairs of compatible inputs and outputs, using a neural network to produce a scalar output that indicates compatibility. To ensure the model doesn't output a zero value for all inputs, contrastive methods and non-contrastive methods are used, with the latter involving a regularizer to ensure higher energy for incompatible pairs. The speaker also discusses the importance of an abstract representation of ideas, rather than direct language input, for effective training and reasoning in these models.

15:06

🖼️ Visual Data and Energy Function in JEA Architectures

The final paragraph explores the application of energy functions in joint embedding architectures (JEA) for visual data. The energy of the system is defined as the prediction error between a corrupted input and the representation of the original, uncorrupted input. This method provides a compressed representation of visual reality, which is effective for classification tasks. The speaker contrasts this approach with the indirect probability adjustments in language models, where increasing the probability of the correct word also decreases the probability of incorrect words, and emphasizes the benefits of a direct compatibility measure for visual data.

Mindmap

Keywords

💡reasoning

Reasoning in the context of the video refers to the cognitive process of drawing inferences, conclusions, or solving problems based on available information. It is a fundamental aspect of human intelligence and decision-making. The video discusses the limitations of current large language models (LLMs) in performing complex reasoning tasks, as they allocate a constant amount of computation per token produced, which does not align with human reasoning that adapts to the complexity of the question at hand.

💡computation

Computation refers to the process of performing mathematical calculations or solving problems using a computer or an algorithm. In the video, it is mentioned that the amount of computation spent per token in LLMs is constant, which limits their ability to handle questions of varying complexity. This contrasts with human reasoning, where more complex problems typically receive more computational effort and time for resolution.

💡token

A token in the context of the video represents a discrete unit of text, such as a word or a character, that is processed by language models. The video highlights that the current LLMs allocate a fixed amount of computation for each token produced, which does not allow for the dynamic adjustment needed for complex problem-solving or reasoning.

💡prediction network

A prediction network in the video refers to the underlying mechanism of an LLM that predicts the next token in a sequence based on the previous tokens. The network's structure, such as having 36 or 92 layers, is multiplied by the number of tokens to determine the amount of computation devoted to generating an answer. This method of operation is contrasted with human reasoning, which is adaptable and can allocate more effort to more complex problems.

💡hierarchical element

The hierarchical element mentioned in the video refers to the multi-layered and structured approach humans use when reasoning, where different levels of abstraction and complexity are considered. This is in contrast to the flat structure of computation in current LLMs, which do not adjust the computational resources based on the complexity of the question being asked.

💡persistent long-term memory

Persistent long-term memory is a concept discussed in the video that refers to the ability of a system to retain and access information over extended periods, which is essential for complex reasoning and planning. Unlike current LLMs, which have limitations in handling long-term dependencies, a system with persistent long-term memory would be better equipped to handle complex tasks that require remembering and integrating information from the past.

💡inference of latent variables

Inference of latent variables is a process in which a system deduces the values of unobserved variables from the observed data. In the context of the video, this refers to a future approach where dialogue systems may use this method to plan and optimize their responses in an abstract representation space before converting them into text. This is contrasted with the auto-regressive prediction of tokens used by current LLMs.

💡energy-based model

An energy-based model, as described in the video, is a type of machine learning model that assigns a scalar value (energy) to input data, indicating the compatibility or goodness of the data. The model is trained to produce low energy values for correct or compatible inputs and high energy values for incorrect or incompatible ones. This concept is proposed as a potential method for future dialogue systems to evaluate and optimize their answers before generating a response.

💡optimization

Optimization in the video refers to the process of adjusting a system or model to achieve the best possible outcome or performance. In the context of dialogue systems, optimization involves fine-tuning the abstract representation of an answer to ensure it is a good response to a given prompt. The video suggests that future systems will use optimization techniques in continuous spaces, which is more efficient than the current method of generating and selecting from many discrete text sequences.

💡latent variables

Latent variables are factors or features that are not directly observed but can be inferred from the patterns in the data. In the video, latent variables are discussed as part of the potential future of dialogue systems, where the system would manipulate these unobserved variables in an abstract representation space to minimize the output of an energy function, thereby optimizing the quality of the response.

💡system one and system two

System one and system two are psychological concepts from the video that refer to two different modes of human thinking and problem-solving. System one is automatic and subconscious, allowing for tasks to be performed without conscious thought, while system two involves deliberate, conscious thought and planning for complex tasks. The video suggests that future dialogue systems may incorporate a similar distinction, with system one being the auto-regressive LMs and system two being more advanced reasoning and planning mechanisms.

Highlights

The reasoning in large language models (LLMs) is considered primitive due to the constant amount of computation spent per token produced.

The computation in LLMs does not adjust based on the complexity of the question, whether it's simple, complicated, or impossible to answer.

Human reasoning involves spending more time on complex problems, with an iterative and hierarchical approach, unlike the constant computation model of LLMs.

There is potential for building mechanisms like persistent long-term memory and reasoning on top of the low-level world model provided by language.

Future dialogue systems may involve planning and optimization before producing an answer, which is different from the current auto-regressive LMs.

The concept of system one and system two in humans is introduced, with system one being tasks accomplished without deliberate thought and system two requiring planning and thought.

LLMs currently lack the ability to use an internal world model for deliberate planning and thought, unlike human system two tasks.

The future of dialogue systems may involve non-auto-regressive prediction and optimization of latent variables in abstract representation spaces.

The idea of an energy-based model is introduced, where the model output is a scalar number representing the quality of an answer for a given prompt.

Optimization processes in continuous spaces are suggested to be more efficient than generating and selecting from many discrete sequences of tokens.

The concept of training an energy-based model with compatible and incompatible pairs of inputs and outputs is discussed.

Contrastive methods and non-contrastive methods are explained as approaches to train energy-based models with different sample requirements.

The importance of an abstract representation of ideas is emphasized for efficient reasoning and planning in dialogue systems.

The indirect method of training LLMs through probability distribution over tokens is highlighted, including its limitations.

The potential application of energy-based models in visual data processing is mentioned, using joint embedding architectures.

The energy function's role in determining the compatibility between inputs and outputs is discussed, with the goal of producing a compressed representation of reality.

Transcripts

00:03

the type of reasoning that takes place

00:04

in llm is very very primitive and the

00:07

reason you can tell is primitive is

00:09

because the amount of computation that

00:11

is spent per token produced is constant

00:15

so if you ask a question and that

00:17

question has an answer in a given number

00:20

of token the amount of competition

00:22

devoted to Computing that answer can be

00:24

exactly estimated it's like you know

00:27

it's how it's the the size of the

00:30

prediction Network you know with its 36

00:32

layers or 92 layers or whatever it is uh

00:35

multiply by number of tokens that's it

00:37

and so essentially it doesn't matter if

00:40

the question being asked is is simple to

00:45

answer complicated to answer impossible

00:48

to answer because it's undecidable or

00:50

something um the amount of computation

00:53

the system will be able to devote to

00:55

that to the answer is constant or is

00:57

proportional to the number of token

00:59

produced in the answer right this is not

01:01

the way we work the way we reason is

01:04

that when we're faced with a complex

01:08

problem or complex question we spend

01:10

more time trying to solve it and answer

01:12

it right because it's more difficult

01:15

there's a prediction element there's a

01:17

iterative element where you're like

01:21

uh adjusting your understanding of a

01:23

thing by going over over and over and

01:25

over there's a hierarchical element so

01:27

on does this mean that a fundamental

01:29

flaw of llms or does it mean

01:32

that there's more part to that

01:35

question now you're just behaving like

01:37

an

01:38

llm immediately answer no that that it's

01:43

just the lowlevel world model on top of

01:46

which we can then build some of these

01:49

kinds of mechanisms like you said

01:51

persistent long-term memory

01:53

or uh reasoning so on but we need that

01:57

world model that comes from language is

02:00

it maybe it is not so difficult to build

02:03

this kind of uh reasoning system on top

02:06

of a well constructed World model OKAY

02:09

whether it's difficult or not the near

02:11

future will will say because a lot of

02:13

people are working on reasoning and

02:15

planning abilities for for dialogue

02:18

systems um I mean if we're even if we

02:20

restrict ourselves to

02:22

language uh just having the ability to

02:25

plan your answer before you

02:27

answer uh in terms that are not

02:29

necessarily linked with the language

02:31

you're going to use to produce the

02:33

answer right so this idea of this mental

02:35

model that allows you to plan what

02:36

you're going to say before you say it MH

02:40

um that is very important I think

02:43

there's going to be a lot of systems

02:45

over the next few years are going to

02:47

have this capability but the blueprint

02:50

of those systems will be extremely

02:52

different from Auto regressive LMS so

02:57

um it's the same difference as has the

03:00

difference between what psychology is

03:02

called system one and system two in

03:03

humans right so system one is the type

03:06

of task that you can accomplish without

03:08

like deliberately consciously think

03:09

about how you do them you just do them

03:13

you've done them enough that you can

03:15

just do it subconsciously right without

03:17

thinking about them if you're an

03:18

experienced driver you can drive without

03:21

really thinking about it and you can

03:23

talk to someone at the same time or

03:24

listen to the radio right um if you are

03:28

a very experienced chest player you can

03:30

play against a non-experienced CH player

03:32

without really thinking either you just

03:34

recognize the pattern and you play mhm

03:36

right that's system one um so all the

03:40

things that you do instinctively without

03:41

really having to deliberately plan and

03:44

think about it and then there is all

03:45

task what you need to plan so if you are

03:48

a not to experienced uh chess player or

03:51

you are experienced where you play

03:52

against another experienced chest player

03:54

you think about all kinds of options

03:56

right you you think about it for a while

03:58

right and you you you're much better if

04:01

you have time to think about it than you

04:02

are if you are if you play Blitz uh with

04:05

limited time so and um so this type of

04:09

deliberate uh planning which uses your

04:12

internal World model um that system to

04:16

this is what LMS currently cannot do so

04:18

how how do we get them to do this right

04:20

how do we build a system that can do

04:22

this kind of planning that or reasoning

04:26

that devotes more resources to complex

04:29

part problems than two simple problems

04:32

and it's not going to be Auto regressive

04:33

prediction of tokens it's going to be

04:36

more something akin to inference of

04:40

latent variables in um you know what

04:44

used to be called probalistic models or

04:47

graphical models and things of that type

04:49

so basically the principle is like this

04:51

you you know the prompt is like observed

04:55

uh variables mhm and what you're what

04:59

the model

05:00

does is that it's basically a

05:03

measure of it can measure to what extent

05:06

an answer is a good answer for a prompt

05:10

okay so think of it as some gigantic

05:12

Neal net but it's got only one output

05:14

and that output is a scalar number which

05:17

is let's say zero if the answer is a

05:19

good answer for the question and a large

05:22

number if the answer is not a good

05:23

answer for the question imagine you had

05:25

this model if you had such a model you

05:28

could use it to produce good answers the

05:30

way you would do

05:32

is you know produce the prompt and then

05:34

search through the space of possible

05:36

answers for one that minimizes that

05:39

number um that's called an energy based

05:42

model but that energy based model would

05:45

need the the model constructed by the

05:49

llm well so uh really what you need to

05:52

do would be to not uh search over

05:55

possible strings of text that minimize

05:57

that uh energy but what you would do it

06:00

do this in abstract representation space

06:02

so in in sort of the space of abstract

06:05

thoughts you would elaborate a thought

06:08

right using this process of minimizing

06:11

the output of your your model okay which

06:14

is just a scalar um it's an optimization

06:17

process right so now the the way the

06:19

system produces its answer is through

06:22

optimization um by you know minimizing

06:25

an objective function basically right uh

06:28

and this is we're talking about

06:28

inference not talking about training

06:30

right the system has been trained

06:32

already so now we have an abstract

06:34

representation of the thought of the

06:36

answer representation of the answer we

06:38

feed that to basically an auto

06:40

regressive decoder uh which can be very

06:42

simple that turns this into a text that

06:45

expresses this thought okay so that that

06:48

in my opinion is the blueprint of future

06:50

dialog systems um they will think about

06:54

their answer plan their answer by

06:56

optimization before turning it into text

07:00

uh and that is turning complete can you

07:03

explain exactly what the optimization

07:05

problem there is like what's the

07:07

objective function just Linger on it you

07:10

you kind of briefly described it but

07:13

over what space are you optimizing the

07:15

space of

07:16

representations goes abstract

07:18

representation abstract repres so you

07:20

have an abstract representation inside

07:22

the system you have a prompt The Prompt

07:24

goes through an encoder produces a

07:26

representation perhaps goes through a

07:27

predictor that predicts a representation

07:29

of the answer of the proper answer but

07:31

that representation may not be a good

07:35

answer because there might there might

07:36

be some complicated reasoning you need

07:38

to do right so um so then you have

07:41

another process that takes the

07:44

representation of the answers and

07:46

modifies it so as to

07:49

minimize uh a cost function that

07:51

measures to what extent the answer is a

07:53

good answer for the question now we we

07:56

sort of ignore the the fact for I mean

07:59

the the issue for a moment of how you

08:01

train that system to measure whether an

08:05

answer is a good answer for for but

08:07

suppose such a system could be created

08:10

but what's the process this kind of

08:12

search like process it's a optimization

08:15

process you can do this if if the entire

08:17

system is

08:18

differentiable that scalar output is the

08:21

result of you know running through some

08:23

neural net MH uh running the answer the

08:26

representation of the answer to some

08:27

neural net then by GR

08:29

by back propag back propagating

08:31

gradients you can figure out like how to

08:33

modify the representation of the answer

08:35

so as to minimize that so that's still

08:37

gradient based it's gradient based

08:39

inference so now you have a

08:40

representation of the answer in abstract

08:42

space now you can turn it into

08:45

text right and the cool thing about this

08:49

is that the representation now can be

08:52

optimized through gr and descent but

08:54

also is independent of the language in

08:56

which you're going to express the

08:58

answer right so you're operating in the

09:00

substract representation I mean this

09:02

goes back to the Joint embedding that is

09:04

better to work in the uh in the space of

09:08

I don't know to romanticize the notion

09:10

like space of Concepts versus yeah the

09:13

space of

09:15

concrete sensory information

09:18

right okay but this can can this do

09:21

something like reasoning which is what

09:22

we're talking about well not really in a

09:24

only in a very simple way I mean

09:26

basically you can think of those things

09:27

as doing the kind of optimization I was

09:30

I was talking about except they optimize

09:32

in the discrete space which is the space

09:34

of possible sequences of of tokens and

09:37

they do it they do this optimization in

09:39

a horribly inefficient way which is

09:41

generate a lot of hypothesis and then

09:43

select the best ones and that's

09:46

incredibly wasteful in terms of uh

09:49

computation because you have you run you

09:51

basically have to run your LM for like

09:53

every possible you know Genera sequence

09:56

um and it's incredibly wasteful

09:59

um so it's much better to do an

10:03

optimization in continuous space where

10:05

you can do gr and descent as opposed to

10:07

like generate tons of things and then

10:08

select the best you just iteratively

10:11

refine your answer to to go towards the

10:13

best right that's much more efficient

10:15

but you can only do this in continuous

10:17

spaces with differentiable functions

10:19

you're talking about the reasoning like

10:22

ability to think deeply or to reason

10:25

deeply how do you know what

10:29

is an

10:31

answer uh that's better or worse based

10:34

on deep reasoning right so then we're

10:37

asking the question of conceptually how

10:39

do you train an energy based model right

10:41

so energy based model is a function with

10:43

a scalar output just a

10:45

number you give it two inputs X and Y M

10:49

and it tells you whether Y is compatible

10:51

with X or not X You observe let's say

10:53

it's a prompt an image a video whatever

10:56

and why is a proposal for an answer a

10:59

continuation of video um you know

11:03

whatever and it tells you whether Y is

11:05

compatible with X and the way it tells

11:07

you that Y is compatible with X is that

11:09

the output of that function will be zero

11:11

if Y is compatible with X it would be a

11:14

positive number non zero if Y is not

11:17

compatible with X okay how do you train

11:19

a system like this at a completely

11:22

General level is you show it pairs of X

11:26

and Y that are compatible equ question

11:28

and the corresp answer and you train the

11:31

parameters of the big neural net inside

11:34

um to produce zero M okay now that

11:37

doesn't completely work because the

11:39

system might decide well I'm just going

11:41

to say zero for everything so now you

11:43

have to have a process to make sure that

11:45

for a a wrong y the energy would be

11:48

larger than zero and there you have two

11:51

options one is contrastive Method so

11:53

contrastive method is you show an X and

11:55

a bad

11:56

Y and you tell the system well that's

11:59

you know give a high energy to this like

12:01

push up the energy right change the

12:02

weights in the neural net that confus

12:04

the energy so that it goes

12:06

up um so that's contrasting methods the

12:09

problem with this is if the space of Y

12:12

is large the number of such contrasted

12:15

samples you're going to have to show is

12:19

gigantic but people do this they they do

12:22

this when you train a system with RF

12:25

basically what you're training is what's

12:28

called a reward model which is basically

12:30

an objective function that tells you

12:32

whether an answer is good or bad and

12:34

that's basically exactly what what this

12:37

is so we already do this to some extent

12:40

we're just not using it for inference

12:41

we're just using it for training um uh

12:45

there is another set of methods which

12:47

are non-contrastive and I prefer those

12:50

uh and those non-contrastive method

12:52

basically

12:53

say uh okay the energy function needs to

12:58

have low energy on pairs of xys that are

13:01

compatible that come from your training

13:03

set how do you make sure that the energy

13:05

is going to be higher everywhere

13:07

else and the way you do this is by um

13:11

having a regularizer a Criterion a term

13:15

in your cost function that basically

13:17

minimizes the volume of space that can

13:21

take low

13:22

energy and the precise way to do this is

13:24

all kinds of different specific ways to

13:26

do this depending on the architecture

13:28

but that's the basic principle so that

13:30

if you push down the energy function for

13:33

particular regions in the XY space it

13:35

will automatically go up in other places

13:37

because there's only a limited volume of

13:40

space that can take low energy okay by

13:43

the construction of the system or by the

13:45

regularizer regularizing function we've

13:48

been talking very generally but what is

13:51

a good X and a good Y what is a good

13:53

representation of X and Y cuz we've been

13:57

talking about language and if you just

13:59

take language directly that presumably

14:02

is not good so there has to be some kind

14:04

of abstract representation of

14:06

ideas yeah so you I mean you can do this

14:09

with language directly um by just you

14:12

know X is a text and Y is the

14:14

continuation of that text yes um or X is

14:17

a question Y is the answer but you're

14:20

you're saying that's not going to take

14:21

it I mean that's going to do what LMS

14:22

are time well no it depends on how you

14:26

how the internal structure of the system

14:28

is built if the if the internal

14:29

structure of the system is built in such

14:31

a way that inside of the system there is

14:34

a latent variable that's called Z that

14:37

uh you can manipulate so as to minimize

14:42

the output

14:43

energy then that Z can be viewed as a

14:46

representation of a good answer that you

14:48

can translate into a y that is a good

14:51

answer so this kind of system could be

14:54

trained in a very similar way very

14:56

similar way but you have to have this

14:58

way of preventing collapse of of

15:00

ensuring that you know there is high

15:02

energy for things you don't train it on

15:05

um and and currently it's it's very

15:09

implicit in llm it's done in a way that

15:11

people don't realize it's being done but

15:12

it is being done is is due to the fact

15:15

that when you give a high probability to

15:18

a

15:19

word automatically you give low

15:21

probability to other words because you

15:23

only have a finite amount of probability

15:26

to go around right there to some to one

15:29

um so when you minimize the cross

15:30

entropy or whatever when you train the

15:33

your llm to produce the to predict the

15:35

next word uh you're increasing the

15:38

probability your system will give to the

15:40

correct word but you're also decreasing

15:41

the probability will give to the

15:42

incorrect words now indirectly that

15:46

gives a low probability to a high

15:49

probability to sequences of words that

15:50

are good and low probability to

15:52

sequences of words that are bad but it's

15:53

very indirect and it's not it's not

15:56

obvious why this actually works at all

15:58

but um because you're not doing it on

16:01

the joint probability of all the symbols

16:03

in a in a sequence you're just doing it

16:05

kind

16:06

of you sort of factorize that

16:08

probability in terms of conditional

16:10

probabilities over successive tokens so

16:13

how do you do this for visual data so

16:15

we've been doing this with all JEA

16:17

architectures basically the joint Bing

16:19

IA so uh there are the compatibility

16:23

between two things is uh you know here's

16:25

here's an image or a video here's a

16:28

corrupted shifted or transformed version

16:29

of that image or video or masked okay

16:33

and then uh the energy of the system is

16:36

the prediction error of

16:40

the

16:42

representation uh the the predicted

16:45

representation of the Good Thing versus

16:46

the actual representation of the good

16:48

thing right so so you run the corrupted

16:51

image to the system predict the

16:53

representation of the the good input

16:55

uncorrupted and then compute the

16:57

prediction error that's energy of the

16:59

system so this system will tell you this

17:01

is a

17:04

good you know if this is a good image

17:06

and this is a corrupted version it will

17:08

give you Zero Energy if those two things

17:10

are effectively one of them is a

17:13

corrupted version of the other give you

17:15

a high energy if the if the two images

17:17

are completely different and hopefully

17:18

that whole process gives you a really

17:21

nice compressed representation of of

17:24

reality of visual reality and we know it

17:26

does because then we use those for

17:28

presentations as input to a

17:30

classification system that

17:31

classification system works really

17:32

nicely

17:52

okay

Rate This

5.0 / 5 (0 votes)

Related Tags
AI ReasoningDialogue SystemsComputational ModelsCognitive HierarchyDeep LearningOptimization ProcessesAbstract RepresentationInference MechanismsNeural NetworksFuture Technology