Compression for AGI - Jack Rae | Stanford MLSys #76

Stanford MLSys Seminars
27 Feb 202359:53

Summary

TLDRIn episode 76 of the Stanford MLS seminar series, the focus is on the intriguing intersection of compression and AGI (Artificial General Intelligence), featuring guest speaker Jack Ray from OpenAI. The talk delves into foundational models and their significant role in shaping the future of machine learning, emphasizing the importance of understanding their training objectives, limitations, and potential. Jack Ray presents a detailed exploration of compression as a key to unlocking AGI, discussing generative models as lossless compressors and highlighting the concept of minimum description length. Through this insightful discussion, the seminar sheds light on the intricate dynamics of foundation models, urging the audience to think deeply about their applications and the broader implications for AI research.

Takeaways

  • 😃 Large language models like GPT-3 are state-of-the-art lossless compressors, able to compress data at rates better than traditional algorithms like gzip.
  • 🤔 The minimum description length principle, which aims to find the smallest possible representation of data, has deep philosophical roots and may be key to achieving artificial general intelligence (AGI).
  • 🧐 Training large language models is essentially a process of lossless compression, where the objective is to minimize the number of bits required to encode the training data.
  • 💡 Scaling up model size and training data can lead to better compression and potentially improved generalization, but algorithmic advances beyond just scaling are also important.
  • ⚠️ While compression is a rigorous objective, evaluating models solely on compression metrics may be uninformative, and tracking emergent capabilities is crucial.
  • 🔍 Arithmetic encoding provides a way to losslessly compress data using a language model's predictions, though the process is computationally expensive.
  • ✨ Architectures that can adaptively allocate compute based on input complexity, like the S4 model, may be important for efficiently compressing multi-modal data like images and audio.
  • 🚧 Lossy compression, while related, is distinct from the lossless compression objective and may not lead to better generalization.
  • 🔑 The description length of a model itself (e.g., the code to instantiate it) is typically small compared to the compressed data size, regardless of model scale.
  • 🌱 Future breakthroughs in areas like data efficiency, adaptive compute, and new architectures could lead to further paradigm shifts in compression and generalization capabilities.

Q & A

  • What is the main topic of the talk?

    -The main topic of the talk is compression for artificial general intelligence (AGI), and how techniques like lossless compression using large language models can potentially help in solving perception and generalization problems.

  • Why is the minimum description length principle important according to the speaker?

    -The speaker argues that seeking the minimum description length of data may be an important principle in solving perception and generalizing well, as it has a rigorous mathematical foundation dating back to philosophers like Aristotle and William of Ockham.

  • How are large language models related to lossless compression?

    -The speaker explains that large language models are actually state-of-the-art lossless compressors, as training them involves minimizing the negative log-likelihood over the training data, which is equivalent to lossless compression.

  • Can you explain the example of Satya and Sundar used to illustrate lossless compression?

    -The example involves Satya encoding a dataset using a trained language model and arithmetic coding, and sending the encoded transcripts and model code to Sundar. Sundar can then reconstruct the original dataset by running the code and using arithmetic decoding with the predicted token probabilities.

  • What is the potential recipe for solving perception and moving towards AGI according to the speaker?

    -The recipe is to first collect all useful perceptual information, and then learn to compress it as best as possible with a powerful foundation model, through techniques like scaling data and compute, or algorithmic advances.

  • What is the main limitation of the compression approach mentioned by the speaker?

    -One limitation is that modeling and compressing everything at a low level (e.g., pixels for images) may be computationally expensive and inefficient, so some form of filtering or semantic understanding may be needed first.

  • How does the speaker view the role of reinforcement learning in relation to compression?

    -The speaker notes that while compression is important for observable data, reinforcement learning and on-policy behavior are still crucial for gathering useful information that may not be directly observable.

  • What is the speaker's opinion on the Hutter Prize for lossless compression?

    -The speaker believes that while the Hutter Prize aims to promote compression, it has not been fruitful because it focuses on compressing a small, fixed amount of data, underestimating the benefits of scaling data and compute.

  • How does the compression perspective inform the development of new architectures?

    -The speaker suggests that the compression perspective could inspire research into architectures that can adapt their compute and attention based on the information content of the input, similar to how biological systems allocate resources non-uniformly.

  • What is the speaker's overall view on the importance of compression research?

    -The speaker believes that while the compression objective provides a rigorous foundation for generalization, the primary focus should be on evaluating and tracking the emergence of new capabilities in models, as those are ultimately what people care about.

Outlines

00:00

🎉 Introduction to Compression and AGI Seminar

The Stanford MLS seminar series introduces a talk by Jack Ray from OpenAI, focusing on compression for Artificial General Intelligence (AGI). The seminar highlights the partnership with CS324 on advances in Foundation Models. Participants are encouraged to engage and ask questions via YouTube chat or Discord. The session promises insightful discussions on the training objectives of foundation models, their limitations, and the significance of compression in the context of AGI.

05:01

📊 Foundation Models and Minimum Description Length

This section delves into the concept of minimum description length (MDL) and its relevance to understanding and improving foundation models. Jack Ray discusses the historical and philosophical underpinnings of seeking the minimum description length for data compression and generalization, referencing Solonoff's theory of inductive inference. The segment also explores generative models as lossless compressors, highlighting how large language models, despite their size, excel in state-of-the-art lossless data compression.

10:01

🔍 Exploring Lossless Compression with Large Language Models

Jack Ray elucidates the mechanics of lossless compression in large language models through a detailed example involving LLaMA models. He demonstrates that larger models, such as the 65 billion parameter version, achieve better compression, hence suggesting superior generalization capabilities. The talk emphasizes the counterintuitive nature of large language models being efficient lossless compressors and explains the mathematical basis for evaluating the compression efficiency of these models.

15:01

🌐 Arithmetic Encoding and Model Training

The seminar continues with an in-depth discussion on arithmetic encoding as a method for data compression. Through a hypothetical scenario involving two individuals, Satya and Sundar, Ray illustrates how arithmetic encoding and decoding work in tandem with a generative model to achieve lossless compression of a dataset. This process underlines the non-dependence of compression efficiency on the size of the neural network but rather on the model's ability to predict next tokens accurately.

20:02

📈 Towards AGI: The Importance of Compression

Jack Ray outlines a two-step approach towards achieving AGI: collecting useful perceptual information and compressing it efficiently using powerful foundation models. He argues that any research method improving compression can advance capabilities towards better perception, supporting the idea with examples of how lossless compression aids in understanding and generalization. Ray also addresses common confusions regarding lossy vs. lossless compression and their implications for neural networks.

25:03

🚀 The Future of Compression in AI Research

In the final part of the seminar, Ray explores potential limitations and future directions for compression in AI research. He touches on practical challenges, such as the computational expense of pixel-level image modeling, and the need for novel architectures that adapt to the informational content of inputs. The discussion concludes with reflections on the integral role of compression in driving advancements in AI and the continuous pursuit of algorithmic improvements alongside computational scaling.

Mindmap

Keywords

💡Compression

Compression refers to the process of encoding information using fewer bits or a smaller representation. In the context of this video, compression is presented as a key objective for training large language models and foundation models. The speaker argues that as these models improve, they are essentially providing better lossless compression of the training data, which should in turn enable better generalization to new data. Compression is viewed as a principled approach to achieving artificial general intelligence (AGI) by learning to compress all useful perceptual information.

💡Minimum Description Length

The minimum description length (MDL) principle states that the best model or representation for a given dataset is the one that leads to the most compressed or succinct description of the data. The speaker roots the compression objective in philosophical principles dating back to Aristotle, as well as more recent work by scholars like Solomonoff and Rissanen. MDL is presented as a rigorous foundation for why compression should help create better generalizing agents and improve perception.

💡Generative Models

Generative models are machine learning models that aim to capture the underlying probability distribution of the training data, allowing them to generate new samples that plausibly belong to the same distribution. In this talk, the speaker argues that generative models like large language models are actually state-of-the-art lossless compressors. By accurately modeling the probability distribution of the data, these models can effectively compress the data through techniques like arithmetic coding.

💡Arithmetic Coding

Arithmetic coding is a technique for lossless data compression. It encodes data by representing it as a small numeric value within a fixed interval, where more probable data is assigned a smaller interval. The speaker uses arithmetic coding as a thought experiment to demonstrate how a generative model can be used to losslessly compress and decompress data, with the model providing the probability estimates needed for coding and decoding. This illustrates how language models are optimizing for compression.

💡Scaling

Scaling refers to the practice of increasing the size or capacity of machine learning models, typically by adding more parameters or increasing the amount of training data and computational resources. The speaker acknowledges that scaling compute, data, and model size has been a major driver of recent progress in large language models and their compression capabilities. However, he also notes that algorithmic advances beyond simply scaling will likely be needed for further paradigm shifts in compression and capabilities.

💡Foundation Models

Foundation models, also known as large language models or generative models, are powerful neural networks trained on vast amounts of data to learn generalizable representations and capabilities. The talk focuses on the role of these models as state-of-the-art compressors and their potential for advancing artificial general intelligence (AGI) by learning to compress all useful perceptual information. Examples discussed include models like GPT-3, LLaMA, and BERT.

💡AGI (Artificial General Intelligence)

Artificial general intelligence (AGI) refers to the development of artificial systems with general intelligence comparable to humans, capable of reasoning, learning, and adapting across a wide range of cognitive tasks. The speaker presents compression as a principled approach for working towards AGI, by collecting useful perceptual information and training powerful foundation models to compress and generalize from this data. AGI is portrayed as the long-term goal that compression research could help achieve.

💡Perception

Perception refers to the ability to acquire, interpret, and understand sensory information and stimuli from the environment. In the context of this talk, the speaker discusses the role of compression in solving perception, which is viewed as a key challenge on the path towards artificial general intelligence (AGI). By learning to compress all useful perceptual information, such as text, images, and audio, foundation models could develop a better understanding and generalization of the world.

💡Lossless Compression

Lossless compression refers to data compression techniques that allow the original data to be perfectly reconstructed from the compressed representation. The speaker argues that generative models like large language models are actually performing lossless compression on their training data, in contrast to lossy compression techniques that discard some information. This lossless compression property is presented as a key feature that enables these models to effectively generalize and understand their training data.

💡Retrieval

Retrieval refers to the ability of a model to access and utilize information from its training data or external sources during inference or generation. The speaker notes that unconstrained retrieval over future data not yet seen during training would be considered cheating from a compression perspective, as it would allow the model to achieve perfect performance without actually compressing and generalizing from the data. Appropriate use of retrieval is discussed as a potential enhancement for compression, but only if done in a principled way.

Highlights

Compression is a has been a objective that actually we are generally striving towards as we build better and larger models which may be counter-intuitive, given the models themselves can be very large.

Generative models are actually lossless compressors and specifically large language models are actually state of the art lossless compressors which may be a counter-intuitive point to many people.

Race Islanders' theory of inductive inference states that if you have a universe of data generated by an algorithm and observations of that universe encoded as a data set, they are best predicted by the smallest executable Archive of that data set, known as the minimum description length.

The size of the lossless compression of a data set can be characterized as the negative log likelihood from a generative model evaluated over the data set, plus the description length of the generative model.

Generative models like large language models are state-of-the-art lossless compressors, able to compress datasets like the one used to train the 65B parameter LLaMA model by 14x compared to the original data size.

Arithmetic encoding allows mapping a token to a compressed transcript using exactly -log2(p) bits, where p is the model's predicted probability for that token. Arithmetic decoding can recover the original token from the transcript if the probability distribution is known.

Larger models trained for more compute steps tend to achieve better compression, explaining their superior generalization performance despite increased model size.

Retrieval-augmented language models that can look ahead at future tokens would be "cheating" from a compression standpoint and may fool performance metrics without true generalization gains.

Model architectures that can dynamically allocate compute based on information content, similar to how human perception works, could improve the inefficiency of current models that spend uniform compute on all inputs.

Pixel-level image and video modeling is very compute-intensive with current architectures but may be viable with architectures that can gracefully process inputs at the appropriate "thinking frequency".

The Hutter prize's small 100MB data limit failed to incentivize meaningful compression research, while the transition to large language models provided a bigger boost.

While compression is a rigorous objective, model capabilities that people fundamentally care about should be continually evaluated alongside compression metrics.

Training for multiple epochs may be justified from a compression perspective if treated as a form of replay, where only predictions on held-out data are scored.

S4 and other architectures that enable longer context lengths and adaptive computation could help model different modalities like audio and images more efficiently.

The pace of innovation in foundation models and their applications is incredibly rapid, with amazing developments expected weekly or bi-weekly in 2023.

Transcripts

00:02

hello everyone and welcome to episode 76

00:06

of the Stanford MLS seminar series

00:08

um today of course we're or this year

00:10

we're very excited to be partnered with

00:12

cs324 advances in Foundation models

00:15

um today I'm joined by Michael say hi

00:19

and ivonica

00:21

um and today our guest is Jack Ray from

00:24

openai and he's got a very exciting talk

00:26

uh prep for us about compression and AGI

00:30

um so so we're very excited to listen to

00:32

him as always if if you have questions

00:34

you can post them in YouTube chat or if

00:36

you're in the class there's that Discord

00:37

Channel

00:38

um so so to keep the questions coming

00:40

and after his talk we will we'll have a

00:42

great discussion

00:43

um so with that Jack take it away

00:47

okay fantastic thanks a lot

00:52

and right

00:56

okay so

00:58

um today I'm going to talk about

01:00

compression for AGI and the theme of

01:02

this talk is that I want people to kind

01:05

of think deeply about uh Foundation

01:09

models and their training objective and

01:12

think deeply about kind of what are we

01:14

doing why does it make sense what are

01:17

the limitations

01:18

um

01:19

this is quite a important topic at

01:22

present I think there's a huge amount of

01:25

interest in this area in Foundation

01:27

models large language models their

01:28

applications and a lot of it is driven

01:31

very reasonably just from this principle

01:33

that it works and it works so it's

01:34

interesting but if we just kind of sit

01:37

within the kind of it works realm it's

01:40

hard to necessarily predict or have a

01:43

good intuition of why it might work or

01:45

where it might go

01:48

so some takeaways that I want so I hope

01:50

people like people hopefully to take

01:52

from this tour car some of them are

01:54

quite pragmatic so I'm going to talk

01:57

about some background on the minimum

01:58

description length and why it's seeking

02:01

the minimum description length of our

02:03

data may be an important role in solving

02:05

perception uh I want to make a

02:08

particular point that generative models

02:10

are actually lossless compressors and

02:12

specifically large language models are

02:15

actually state of the art lossless

02:16

compressors which may be a

02:19

counter-intuitive point to many people

02:20

given that they are very large and use a

02:23

lot of space and I'm going to unpack

02:25

that

02:26

in detail and then I'm also going to

02:29

kind of end on some notes of limitations

02:32

of the approach of compression

02:35

so

02:37

let's start with this background minimum

02:38

description length and why it relates to

02:40

perception so

02:42

even going right back to the kind of

02:44

ultimate goal of learning from data we

02:48

may have some set of observations that

02:50

we've collected some set of data that we

02:52

want to learn about which we consider

02:55

this small red circle

02:57

and we actually have a kind of a

03:00

two-pronged goal we want to learn like

03:02

uh how to kind of predict and understand

03:05

our observed data with the goal of

03:09

understanding and generalizing to a much

03:10

larger set of Universe of possible

03:12

observations so we can think of this as

03:16

if we wanted to learn from dialogue data

03:19

for example we may have a collection of

03:21

dialogue transcripts but we don't

03:23

actually care about only learning about

03:25

those particular dialogue transcripts we

03:27

want to then be able to generalize to

03:29

the superset of all possible valid

03:31

conversations that a model may come

03:33

across right so

03:36

what is an approach what is a very like

03:38

rigorous approach to trying to learn to

03:41

generalize well I mean this has been a

03:43

philosophical question for multiple

03:45

thousands of years

03:47

um

03:48

and even actually kind of full Century

03:51

BC uh there's like some pretty good

03:53

um principles that philosophers are

03:56

thinking about so Aristotle had this

03:59

notion of

04:00

um

04:02

assuming the super superiority of the

04:04

demonstration which derives from fewer

04:06

postulates or hypotheses so this notion

04:09

of uh we have some

04:11

[Music]

04:11

um

04:12

um simple set of hypotheses

04:15

um

04:16

then this is probably going to be a

04:18

superior description of a demonstration

04:21

now this kind of General kind of simpler

04:23

is better

04:25

um

04:26

theme is more recently attributed to

04:29

William 14th century or Cam's Razer this

04:33

is something many people may have

04:34

encountered during a machine learning or

04:36

computer science class

04:38

he is essentially continuing on this

04:40

kind of philosophical theme the simplest

04:42

of several competing explanations is

04:44

always likely likely to be the correct

04:46

one

04:47

um now I think we can go even further

04:50

than this within machine learning I

04:52

think right now Occam's razor is almost

04:54

used to defend almost every possible

04:56

angle of research but I think one

04:58

actually very rigorous incarnation of

05:00

what comes Razer is from race Island's

05:04

theory of inductive inference 1964. so

05:06

we're almost at the present day and he

05:08

says something quite concrete and

05:09

actually mathematically proven which is

05:11

that if you have a universe of data

05:13

which is generated by an algorithm and

05:15

observations of that universe so this is

05:17

the small red circle

05:19

encoded as a data set are best predicted

05:21

by the smallest executable Archive of

05:23

that data set so that says the smallest

05:25

lossless prediction or otherwise known

05:28

as the minimum description length so I

05:30

feel like that final one is actually

05:31

putting into mathematical and quite

05:33

concrete terms

05:34

um these kind of Notions that existed

05:37

through timing velocity

05:38

and it kind of we could even relate this

05:40

to a pretty I feel like that is a quite

05:43

a concrete and actionable retort to this

05:46

kind of

05:47

um quite

05:48

um murky original philosophical question

05:51

but if we even apply this to a

05:52

well-known philosophical problem cells

05:54

Chinese room 4 experiment where there's

05:57

this notion of a computer program or

05:58

even a person kind of with it within a

06:01

room that is going to perform

06:02

translation from English English to

06:05

Chinese and they're going to

06:07

specifically use a complete rulebook of

06:10

all possible

06:12

inputs or possible say English phrases

06:15

they receive and then and then the

06:16

corresponding say Chinese translation

06:18

and the original question is does this

06:20

person kind of understand how to perform

06:22

translation uh and I think actually this

06:24

compression argument this race on this

06:26

compression argument is going to give us

06:28

something quite concrete here so uh this

06:31

is kind of going back to the small red

06:32

circle large white circle if if we have

06:35

all possible translations and then we're

06:38

just following the rule book this is

06:39

kind of the least possible understanding

06:41

we can have of translation if we have

06:42

such a giant book of all possible

06:44

translations and it's quite intuitive if

06:46

we all we have to do is coin a new word

06:49

or have a new phrase or anything which

06:50

just doesn't actually fit in the

06:52

original book this system will

06:54

completely fail to translate because it

06:56

has the least possible understanding of

06:58

translation and it has the least

06:59

understandable version of translation

07:02

because that's the largest possible

07:03

representation of the the task the data

07:06

set however if we could make this

07:08

smaller maybe we kind of distill

07:12

sorry we distill this to a smaller set

07:13

of rules some grammar some basic

07:15

vocabulary and then we can execute this

07:17

program maybe such a system has a better

07:19

understanding of translation so we can

07:21

kind of grade it based on how compressed

07:23

this rulebook is and actually if we

07:24

could kind of compress it down to the

07:27

kind of minimum description like the

07:28

most compressed format the task we may

07:30

even argue such a system has the best

07:32

possible understanding of translation

07:35

um now for foundation models we

07:38

typically are in the realm where we're

07:39

talking about generator model one that

07:40

places probability on natural data and

07:43

what is quite nice is we can actually

07:44

characterize the lossless compression of

07:46

a data set using a generator model in a

07:48

very precise mathematical format so race

07:51

on enough says we should try and find

07:53

the minimum description length well we

07:55

can actually try and do this practically

07:57

with a generator model so the size the

08:00

lossless compression of our data set D

08:02

can be characterized as the negative log

08:05

likelihood from a genetic model

08:06

evaluated over D plus the description

08:09

length of this generator model so for a

08:14

neural network we can think of this as

08:15

the amount of code to initialize the

08:17

neural network

08:18

that might actually be quite small

08:21

this is not actually something that

08:23

would be influenced by the size of the

08:24

neural network this would just be the

08:26

code to actually instantiate it so it

08:29

might be a couple hundred kilobytes to

08:31

actually Implement a code base which

08:32

trains a transformer for example and

08:35

actually this is quite a surprising fact

08:37

so what does this equation tell us does

08:40

it tell us anything new well I think it

08:42

tells us something quite profound the

08:44

first thing is we want to minimize this

08:46

general property and we can do it by two

08:48

ways one is via having a generative

08:51

model which has better and better

08:52

performance of our data set that is a

08:54

lower and lower negative log likelihood

08:55

but also we are going to account for the

08:58

prior information that we inject into F

09:01

which is that we can't stuff F full of

09:04

priors such that maybe it gets better

09:06

performance but overall it does not get

09:08

a bit of a compression

09:10

um so

09:12

on that note yeah compression is a a

09:15

cool way of thinking about

09:17

how we should best model our data and

09:19

it's actually kind of a non-gameable

09:21

objective so contamination is a big

09:24

problem within uh machine learning and

09:27

trying to evaluate progress is often

09:29

hampered by Notions of whether or not

09:31

test sets are leaked into training sense

09:33

well with compression this is actually

09:36

not not something we can game so imagine

09:39

we pre-trained F on a whole data set D

09:42

such that it perfectly memorizes the

09:44

data set

09:45

AKA such that the probability of D is

09:48

one log probability is zero in such a

09:51

case if we go back to this formula the

09:53

first term will zip to zero

09:56

however now essentially by doing that by

09:58

injecting and pre-training our model on

10:01

this whole data set we have to add that

10:03

to the description length of our

10:04

generative model so now F not only

10:06

contains the code to train it Etc but it

10:08

also contains essentially a description

10:10

length of d

10:11

so in this setting essentially a

10:12

pre-contaminating f it does not help us

10:15

optimize the compression

10:18

and this contrasts to regular test set

10:20

benchmarking where we may be just

10:22

measuring test set performance and

10:24

hoping that measures generalization and

10:26

is essentially a proxy for compression

10:27

and it can be but also we can find lots

10:30

and lots of scenarios where we

10:31

essentially have variations of the test

10:33

set that have slipped through the net in

10:35

our training set and actually even right

10:37

now within Labs comparing large language

10:40

models this notion of contamination

10:42

affecting eval resources a continual

10:45

kind of phone in um in in the side of

10:48

kind of clarity

10:49

Okay so we've talked about philosophical

10:52

backing of the minimum description

10:54

length and maybe why it's a sensible

10:56

objective

10:58

and now I'm going to talk about it

10:59

concretely for large language models and

11:01

we can kind of map this to any uh

11:04

generative model but I'm just going to

11:06

kind of ground it specifically in the

11:07

marsh language model so if we think

11:10

about what is the log problem of our

11:11

data D well it's the sum of our next

11:14

token prediction of tokens over our data

11:18

set

11:19

[Music]

11:19

um

11:20

so this is something that's essentially

11:22

our training objective if we think of

11:24

our data set D

11:25

um and we have one Epoch then this is

11:28

the sum of all of our training loss so

11:30

it's pretty tangible term it's a real

11:31

thing we can measure and F is the

11:33

description length of our

11:35

Transformer language model uh and

11:38

actually there are people that have

11:39

implemented a Transformer and a training

11:41

regime just without any external

11:43

libraries in about I think 100 to 200

11:45

kilobytes so this is actually something

11:47

that's very small

11:49

um and and as I said I just want to

11:51

enunciate this this is something where

11:53

it's not dependent on the size of our

11:55

neural network so if a piece of code can

11:57

instantiate a 10 layer Transformer the

12:00

same piece of code you can just change a

12:02

few numbers in the code it can

12:03

instantiate a 1000 layer Transformer

12:05

actually the description length of our

12:07

initial Transformer is unaffected really

12:10

by how large the actual neural network

12:13

is we're going to go through an example

12:15

of actually using a language model to

12:16

losslessly compress where we're going to

12:18

see why this is the case

12:21

okay so let's just give like a specific

12:23

example and try and ground this out

12:25

further so okay llama it was a very cool

12:28

paper that came out from fair just like

12:29

late last week I was looking at the

12:32

paper here's some training curves

12:34

um now forgetting the smaller two models

12:37

there are the two largest models are

12:39

trained on one Epoch of their data set

12:41

so actually we could sum their training

12:43

losses uh AKA this quantity

12:47

and we can also roughly approximate the

12:50

size of of the um of the code base that

12:53

was used to train them

12:56

um and therefore we can see like okay

12:58

which of these two moles the 33b or the

13:00

65b is the better compressor and

13:01

therefore which would we expect to be

13:03

the better model at generalizing and

13:05

having greater set of capabilities so

13:09

it's pretty it's going to be pretty

13:11

obvious at 65b I'll tell you why firstly

13:13

just to drum this point home these

13:16

models all have the same description

13:17

length they have different number of

13:18

parameters but the code that's used to

13:20

generate them is actually of same of the

13:23

same complexity however they don't have

13:25

the same integral of the training loss

13:28

65b has a smaller integral Windows

13:31

training loss

13:32

and therefore if we plug if we sum these

13:35

two terms we would find that 65b

13:36

essentially creates the more concise

13:39

description of its training data set

13:42

okay so that might seem a little bit

13:43

weird I'm going to even plug some actual

13:44

numbers in let's say we assume it's

13:46

about one megabyte for the code to

13:48

instantiate and train the Transformer

13:50

and then if we actually just calculate

13:53

this roughly it looks to be about say

13:55

400 gigabytes

13:57

um

13:58

you have some of your log loss

13:59

converting into bits and then bytes it's

14:02

going to be something like 400 gigabytes

14:03

and this is from an original data set

14:06

which is about 5.6 terabytes of rortex

14:08

so 1.4 trillion tokens times four is

14:11

about 5.6 terabytes so that's a

14:13

compression rate of 14x

14:15

um the best text compressor on the

14:17

Hudson prize is 8.7 X so the takeaway of

14:20

this point is

14:21

um actually as we're scaling up and

14:24

we're creating more powerful models and

14:25

we're training them on more data we're

14:27

actually creating something which

14:29

actually is providing a lower and lower

14:31

lossless compression of our data even

14:34

though the intermediate model itself may

14:36

be very large

14:40

okay so now I've talked a bit about how

14:43

large language models are state of the

14:45

art lossless compressors but I just want

14:47

to maybe go through the mechanics of how

14:49

do we actually get a something like a

14:51

generative model literally losslessly

14:53

compress this may be something that's

14:55

quite mysterious like what is happening

14:57

like

14:57

when you actually losslessly compress

14:59

this thing is it the weights or is it

15:01

something else

15:02

so I'm going to give us a hypothetical

15:04

kind of scenario we have two people sat

15:07

here in Sundar Satya wants to send a

15:09

data set of the world's knowledge

15:10

encoded in D to send R they both have

15:13

access to very powerful supercomputers

15:15

but there's a low bandwidth connection

15:17

we are going to use a trick called

15:19

arithmetic encoding as a way of

15:22

communicating the data set so say we

15:24

have a token x a timestep t from of some

15:27

vocab and a probability distribution p

15:29

over tokens

15:31

arithmetic encoding without going into

15:33

the nuts and bolts is a way of allowing

15:35

us to map our token x given our

15:38

probability distribution over tokens to

15:41

some Z

15:43

where Z is essentially our compressed

15:46

transcripts of data and Z is going to

15:49

use exactly minus log 2 p t x t bits so

15:54

the point of this step is like

15:58

arithmetic encoding actually Maps it to

16:00

some kind of like floating Point number

16:01

as it turns out and it's a real

16:04

algorithm this is like something that

16:05

exists in the real world it does require

16:08

technically infinite Precision to to use

16:10

exactly these number of bits and

16:12

otherwise you maybe you're going to pay

16:14

a small cost for implementation but it's

16:16

roughly approximately optimal in terms

16:19

of the encoding and we can use

16:20

arithmetic decoding

16:22

um to take this encrypted transcript and

16:25

as long as we have our probability

16:26

distribution of tokens we can then

16:28

recover the original token so we can

16:30

think about probability probability

16:32

distribution as kind of like a key it

16:34

can allow us to kind of lock in a

16:36

compressed copy of our token and then

16:38

unlock it

16:39

so if p is uniform so there's no

16:42

information about our tokens then this

16:45

would be this one over v p is just one

16:47

over the size of V so we can use log 2 V

16:49

bits of space uh that is just

16:52

essentially the same as naively storing

16:53

in binary uh our our XT token if p is an

16:58

oracle so it knows like exactly what the

17:00

token was going to be

17:01

so P of x equals one then log 2p equals

17:05

zero and this uses zero space so these

17:08

are the two extremes and obviously what

17:10

we want is a generative model which

17:11

better and better molds our data and

17:13

therefore it uses less space

17:15

so what would actually happen in

17:17

practice if Satya can take his data set

17:20

of tokens trainer Transformer and get a

17:23

subsequent set of probabilities uh over

17:27

the tokens like so next token prediction

17:29

and then use arithmetic encoding to map

17:32

it to this list of transcripts and this

17:34

is going to be of size sum of negative

17:37

log likelihood of your Transformer over

17:39

the data set

17:40

and he's also going to send he's going

17:42

to send that list of transcripts and

17:44

some code that can deterministically

17:46

train a larger Transformer

17:48

and so

17:49

he sends those two things what does that

17:52

equal in practice the size of f the size

17:54

of your generator model description plus

17:57

the size of your some of your negative

17:59

log likelihood of your data set so as

18:02

you can see it doesn't matter whether

18:04

the Transformer was one billion

18:06

parameters one trillion parameters

18:09

plus plus he's not actually sending the

18:12

neural network he's sending the

18:13

transcript of encoded logits plus the

18:17

code

18:18

and then on the other side Sundar can

18:20

run this code which is deterministic and

18:22

the mod is going to run the neural

18:24

network it gives a probability

18:25

distribution to the first token he's

18:27

going to use arithmetic decoding with

18:29

that to get his first token you can

18:31

either train on that or whatever the

18:32

code does so then continue on

18:35

predict the next token etc etc and

18:37

essentially

18:39

iteratively go through and recover the

18:41

whole data set

18:42

um so this is kind of like almost a

18:44

fourth experiment because in practice to

18:46

send this data at 14x compressed

18:48

compression say if we're talking about

18:50

the Llama model uh that's it's a bit

18:52

more compressed than gzip but this is

18:54

requiring a huge amount of intermediate

18:56

compute switches to train a large

18:58

language model which feels inhibitive

19:00

but this thought experiment is really

19:02

derived not because we actually might

19:04

want to send data on a smaller and

19:07

smaller bandwidth it's also just derived

19:09

to kind of explain and prove why we can

19:12

actually losslessly compress with

19:14

language models and why that is their

19:16

actual objective

19:18

um and if this kind of setup feels a

19:21

little bit contrived well the fun fact

19:23

is this is the exact setup that called

19:25

Shannon was thinking about

19:26

um when he kind of proposed language

19:28

models in the 40s he was thinking about

19:30

having a discrete set of data and how

19:33

can we better communicate to overload

19:35

over a low bandwidth Channel and

19:37

language models and entropy coding

19:39

essentially was the topic that he was

19:41

thinking about about labs

19:46

Okay so we've talked mechanically about

19:48

well we've talked about the philosophy

19:50

of kind of why do why why be interested

19:53

in description length relating it to

19:55

generalization talks about why

19:57

generative models are lossless

19:59

compressors talked about why our current

20:02

large language models are actually

20:03

state-of-the-art lossless compressors

20:05

than our providing some of the most

20:07

compressed representations of our source

20:09

data so let's just think about solving

20:12

perception and moving towards AGI what's

20:14

the recipe well it's kind of a two-step

20:16

process one is collect all useful

20:19

perceptual information that we want to

20:21

understand and the second is learn to

20:23

compress it as best as possible with a

20:25

powerful Foundation model

20:26

so the nice thing about this is it's not

20:29

constrained to a particular angle for

20:32

example you can use any research method

20:34

that improves compression and I would

20:36

posit that this will further Advance our

20:38

capabilities towards perception based on

20:41

this rigorous foundation so that might

20:43

be a better architecture it may be scale

20:45

further scaling of data and computes

20:48

this is in fact something that's almost

20:49

become a meme people say scale is all

20:52

you need but truly I think scale is only

20:56

going to benefit as long as it is

20:57

continuing to significantly improve

21:00

compression but you could any use any

21:02

other technique and this doesn't have to

21:04

be just a regular generative model it

21:06

could even we could even maybe spend a

21:08

few more bits on the description length

21:10

of F and add in some tools add in things

21:12

like a calculator allow it to make use

21:15

of tools to better predict its data

21:16

allow it to retrieve over the past use

21:19

its own synthetic data to generate and

21:21

then learn better there's many many

21:22

angles we could think about that are

21:25

within the scope of a model

21:27

better better compressing it Source data

21:29

to generalize over the universe of

21:30

possible observations

21:33

I just want to remark at this point on a

21:36

very common point of confusion on this

21:38

topic which is about lossy compression

21:40

so I think it's a very reasonable

21:43

um

21:44

thought to maybe confuse what a neural

21:47

network is doing with glossy compression

21:49

especially because

21:51

information naturally seeps in from the

21:54

source training data into the weights of

21:56

a neural network and neural network can

21:58

often memorize it often does memorize

21:59

and can repeat many things that it's

22:01

seen but it doesn't repeat everything

22:03

perfectly so it's lossy and it's also

22:05

kind of a terrible lossy compression

22:07

algorithm so if in the velocity

22:09

compression case you would actually be

22:12

transmitting the weights of the

22:14

parameters of a neural network and they

22:16

can often actually be larger than your

22:17

Source data so I think there's a very

22:19

interesting New Yorker article about

22:21

about this kind of Topic in general kind

22:23

of thinking about you know what are what

22:25

are language models doing what are

22:26

Foundation models doing and I think

22:28

there's a lot of confusion in this

22:30

article specifically on this topic where

22:32

from the perspective of glossy

22:35

compression

22:36

and neural network feels very kind of

22:38

sub-optimal it's losing information in

22:40

Red so it doesn't even do reconstruction

22:42

very well and it's potentially bloated

22:44

and larger and has all these other

22:46

properties

22:47

I just wanted to take this kind of

22:49

point to reflect

22:51

on the original goal which is we really

22:53

care about understanding and

22:55

generalizing to the space of the

22:57

universe of possible observations so we

22:59

don't care and we don't train towards

23:01

reconstructing our original data

23:04

um I think if we did then this article

23:08

basically concludes like if we did just

23:10

care about reconstructing this original

23:11

data like why do we even train over it

23:13

why not just keep the original data as

23:15

it is and I think that's a very valid

23:16

point uh but if we care instead about

23:19

loss like a lossless compression of this

23:22

then essentially this talk is about

23:25

linking that to this wider problem of

23:27

generalizing to many many different

23:29

types of unseen data

23:34

great so I've talked about

23:37

the mechanics of compression with

23:40

language models and linking it to this

23:42

confusion of velocity compression what

23:45

are some limitations that I think are

23:46

pretty valid

23:48

um so I think

23:50

there's one concern with this approach

23:52

which is that it may be just the right

23:55

thing to do or like an unbiased kind of

23:58

attempt at solving perception but maybe

24:00

it's just not very pragmatic and

24:03

actually trying to kind of model

24:04

everything and compress everything it

24:06

may be kind of correct but very

24:07

inefficient so I think Image level

24:09

modeling is a good example of this where

24:12

modeling a whole image at the pixel

24:14

level has often kind of been

24:16

prohibitively expensive to like work

24:18

incredibly well and therefore people

24:21

have changed the objective or or ended

24:23

up modeling a slightly

24:25

more semantic level

24:28

um and I think even if it maybe seems

24:31

plausible now we can go back to pixel

24:32

level image modeling and maybe we just

24:34

need to tweak the architecture if we

24:35

turn this to video modeling every pixel

24:37

of every frame it really feels

24:39

preemptively crazy and expensive so one

24:42

limitation is you know maybe we do need

24:44

to kind of first filter like what are

24:46

what are all the pieces of information

24:47

that we know we definitely are still

24:49

keeping and we want to model but then

24:51

try and have some way like filtering out

24:53

the extraneous communicate computation

24:55

the the kind of bits of information we

24:57

just don't need and then maybe we can

24:59

then filter out to a much smaller subset

25:01

and then and then we losslessly compress

25:03

that

25:04

um

25:05

another very valid point is I think this

25:08

is often framed uh to people that maybe

25:11

are thinking that this is like the only

25:13

ingredient for AGI is that crucially

25:15

there's lots of just very useful

25:17

information in the world that is not

25:18

observable and therefore we can't just

25:21

expect to compress all observable

25:24

observations achieve AGI because

25:26

there'll just be lots of things we're

25:27

missing out

25:28

um so I think a good example of this

25:30

would be something like Alpha zero so

25:33

playing the game of Go

25:35

um

25:36

I think if you just observe the limited

25:38

number of human games that have ever

25:40

existed one thing that you're missing is

25:42

all of the intermediate search trees of

25:44

all of these expert players and one nice

25:46

thing about something like Alpha zero

25:47

with its kind of self-play mechanism is

25:49

you essentially get to collect lots of

25:51

data of intermediate search trees of

25:53

many many different types of games

25:55

um so that kind of on policy behavior of

25:57

like actually having an agent that can

25:59

act and then Source out the kind of data

26:00

that it needs I think is still very

26:02

important so and in no way kind of

26:04

diminishing uh the importance of RL or

26:06

on policy kind of behavior

26:09

um but I think yeah for for everything

26:11

that we can observe

26:13

um that this is kind of like the

26:15

compression story ideally applies

26:19

great so going to conclusions

26:22

um

26:24

so compression is a has been a objective

26:28

that actually we are generally striving

26:30

towards as we build better and larger

26:32

models which may be counter-intuitive

26:34

given the models themselves can be very

26:36

large

26:37

um

26:38

the most known entity right now the one

26:41

on a lot of people's minds to better

26:43

compression is actually scale scaling

26:45

compute

26:46

um and and maybe even scaling memory but

26:49

scale isn't all you need there are many

26:51

algorithmic advances out there that I

26:54

think very interesting research problems

26:55

and

26:57

and if we look back uh basically all of

27:00

the major language modeling advances

27:02

have been synonymous with far greater

27:04

text compression so even going back from

27:07

uh the creation of engram models on pen

27:10

and paper and then kind of bringing them

27:12

into computers and then having like kind

27:14

of computerized huge tables of engram

27:16

statistics of language this kind of

27:18

opened up the ability for us to do

27:21

um things like speech to text with a

27:23

reasonable accuracy

27:25

um bringing that system to uh deep

27:29

learning via rnns has allowed us to have

27:32

much more fluent text that can span

27:34

paragraphs and then actually be

27:35

applicable to tasks like translation and

27:39

then in the recent era of large-scale

27:41

Transformers we're able to further

27:43

extend the context and extend the model

27:46

capabilities via compute such that we

27:50

are now in this place where we're able

27:52

to use

27:53

language models and Foundation models in

27:55

general

27:57

um to understand very very long spans of

27:59

text and to be able to create incredibly

28:01

useful or incredibly tailored incredibly

28:03

interesting

28:04

um Generations so I think this is going

28:07

to extend but it's a big and interesting

28:10

open problem uh what are going to be the

28:12

advances to kind of give us further

28:15

Paradigm shifts in this kind of

28:16

compression uh improved compression

28:21

right so

28:22

um yeah this talk is generally just a

28:24

rehash for the message of

28:26

former and current colleagues of mine

28:27

especially Marcus to Alex Graves Joel

28:30

Vanessa so I just want to acknowledge

28:32

them and uh thanks a lot for listening

28:34

I'm looking forward to uh chatting about

28:36

some questions

28:38

great thanks so much Jack

28:41

um I'm actually going to ask you to keep

28:42

your slides on the screen because I

28:44

think we had some uh questions about uh

28:48

just kind of uh understanding the

28:51

um some some of the mathematical

28:53

statements in the talk so I think it

28:55

would be helpful to to kind of go go

28:56

back over some of the slides yeah I

29:00

think uh some people were confused a bit

29:02

by the arithmetic decoding

29:05

um so in particular uh maybe it'll be

29:07

useful to to go back to discussion of

29:09

the arithmetic decoding and uh I think

29:11

people are a bit confused about

29:13

um how is it possible for the receiver

29:16

to decode the message and get the

29:19

original data set back without having

29:21

access to the train bottle

29:23

yeah

29:25

um well okay

29:27

um I'll do in two steps so one let's

29:30

just imagine they don't have the fully

29:32

trained model that they have a partially

29:33

trained model

29:35

and so they are able to get a next token

29:37

prediction

29:38

and then

29:40

um

29:40

they have the the receiver also has some

29:44

of the encoded transcripts at T this

29:46

allows them I guess maybe here in the

29:49

case of language modeling this would

29:51

look like XT plus one say if it was like

29:52

PT Plus one but anyway

29:54

um this may allow them to recover the

29:57

next token and then they're going to

29:59

build it up in this way so maybe I'll

30:01

just delay on this particular Slide the

30:04

idea it would look like is we we the

30:06

receiver does not receive the neural

30:08

network it just receives the code to

30:09

instantiate kind of the fresh neural

30:11

network and run the identical training

30:14

setup that it saw before and obviously

30:16

the training setup as it saw before

30:18

we're going to imagine like batch size

30:19

of one one token at a time just for

30:21

Simplicity so uh and let's just imagine

30:24

maybe there's like a beginning of text

30:27

token here first so

30:29

so the receiver so now he just has to

30:31

run the code at first there's nothing to

30:33

decode yet there's no tokens and there's

30:35

a fresh neural network uh that's going

30:37

to give us like a probability

30:39

distribution for the first token and so

30:41

he's got this probability distribution

30:43

for the first token and he's got the

30:44

transcript

30:46

um of what that token should be and you

30:48

can use arithmetic decoding to actually

30:49

recover that first token

30:51

and then let's imagine for Simplicity we

30:54

actually like train like one SGD step on

30:56

one token at a time so we take our SGD

30:58

step and then we have the model that's

31:01

like was used to predict the next token

31:03

so we can get that P2 we have Z2 and

31:06

then we can recover X2 so now we've

31:09

recovered two tokens and we can

31:10

essentially do this iteratively

31:12

essentially reproduce this whole

31:15

training procedure on the receiving side

31:17

and dur as we reproduce the whole

31:19

training procedure we actually recover

31:21

the whole data set

31:23

yeah so it's a crazy expensive way of

31:27

actually encrypt like uh compressing

31:30

data and it might feel once again like

31:32

oh but since we're not going to

31:34

literally do that it's too expensive why

31:36

do I need to learn about it and this

31:38

really is just a way of it's like a

31:41

proof by Construction in case

31:44

um you were like you know is this

31:46

actually true like is the lossless

31:48

compressed D actually equal to this and

31:50

it's like yeah like here's how we

31:51

literally can do it and it's just the

31:53

reason we don't do it in practice is

31:54

because it would be very expensive but

31:56

there's nothing actually stopping us

31:57

it's not like completely theoretical

31:59

idea yeah

32:02

okay so all right so to kind of maybe

32:06

I'll try to explain it back to you and

32:08

then um if people on the chat and the uh

32:12

Discord shell of questions

32:14

um they they can ask and then we can we

32:16

can get some clarifications so basically

32:18

you're saying you initialize a model

32:21

um you have it do like some beginning of

32:24

token thing and it'll predict what what

32:26

it thinks the first uh what the first

32:29

token should be

32:30

um and then you use arithmetic encoding

32:33

to somehow say okay here's the here's

32:35

the prediction and then we're going to

32:37

correct it to the the actual what the

32:39

actual token is so that Z1 has enough

32:42

information to figure out what that

32:44

actual first token is yeah and then you

32:46

use that first token run one step of SGD

32:49

predict you know get the probability

32:51

distribution for the second one now you

32:54

have enough information to decode uh the

32:57

the second thing like maybe

32:59

you know uh yeah uh it's like take the

33:03

ARG Max but you know take the the third

33:05

rmx or Max or something like that

33:08

um and then so you're saying that that

33:10

is enough information to reconstruct the

33:13

the data set D exactly yeah

33:17

okay great great so uh yeah so I I

33:21

personally you know I understand a bit

33:23

better now and that that also makes

33:24

sense why the model

33:26

um you know the the model weights and

33:28

the the size of the model are not uh

33:31

actually part of that that compression

33:34

um one question that that I also had

33:36

while

33:38

um you know uh talking through that

33:40

explanation so how does that you know

33:43

compression now go back and uh how's

33:46

that related to the loss curve that you

33:48

get

33:49

um at the end of training is it that the

33:52

better your model is by the end of

33:53

training then you need to communicate

33:54

less information just like I don't know

33:56

take art Max or something like that so I

33:58

just want to say yeah like this is a

34:00

Formula if we look at this this is

34:02

basically pretty much the size of your

34:04

arithmetic encoded transcript

34:07

and this is you like your the log

34:09

negative log likelihood of your next

34:10

token prediction at every step so let's

34:13

just imagine this was batch size one

34:15

this is literally the sum

34:18

of every single training loss point

34:20

because it and the summing under a curve

34:23

this is like the integral into the Curve

34:26

so this

34:27

this value equals this and I did I did

34:30

it just by summing under this curve so

34:32

it's like a completely real quantity you

34:34

get you actually even are getting from

34:37

your training curve

34:38

so it's a little bit different to just

34:40

the final training loss it's the

34:42

integral during the whole training

34:43

procedure

34:46

great so okay and then yeah

34:49

we can think of during training we're

34:51

going along and let's imagine we're in

34:53

the one Epoch scenario we're going along

34:55

and then every single step we're

34:56

essentially get a new kind of out of uh

34:59

out of sample like a new

35:02

sequence to try and predict and then all

35:04

we care about is trying to predict that

35:06

as best as possible and then continuing

35:08

that process and actually what we care

35:10

about is essentially all predictions

35:12

equally and trying to get the whole

35:14

thing to learn like either faster

35:15

initially and then to a lower value or

35:18

however we want we just want to minimize

35:20

this integral and basically what this

35:22

formula says it can minimize this

35:23

integral we should get something that's

35:24

essentially better and better

35:26

understands uh the data or at least

35:28

generalizes better and better

35:31

gotcha okay cool

35:34

um all right so uh let me see I think

35:36

now is a good time to end the screen

35:38

share

35:39

great okay cool

35:41

um and now uh we can go to to some more

35:44

questions uh in the in the class so

35:47

there there were a couple questions

35:48

around

35:50

um kind of uh what does this compression

35:53

uh Viewpoint allow you to do so there's

35:56

a couple questions on so has this mdl

35:59

perspective kind of

36:01

um informed the ways that you would that

36:03

we train models now or any of the

36:05

architectures that we've done now yeah

36:07

can I I think the most like immediate

36:09

one is that it clarifies a long-standing

36:12

point of confusion even within the

36:14

academic Community which is

36:16

um people don't really understand why a

36:19

larger model that seems to even

36:22

um

36:22

like why should it not be the case

36:25

that's smaller neural network less

36:26

parameters generalizes better I think

36:28

people have taken

36:30

um

36:31

like principles from like when they

36:33

study linear models and they were

36:34

regularized to have like less parameters

36:36

and there was some bounds like VC bounds

36:39

on

36:40

um

36:41

generalization and there was this

36:43

General notion of like less parameters

36:44

is what outcomes razor refers to

36:47

um one perspective this helps is a like

36:50

I think it frees up our mind of like

36:51

what is the actual objective that we

36:53

should expect to optimize towards that

36:56

will actually get us the thing we want

36:57

which is better generalization so for me

37:00

that's the most important one even on

37:02

Twitter I see it like professors in

37:05

machine learning occasionally you'll see

37:07

like they'll say some like smaller

37:08

models are more intelligent than larger

37:10

models kind of it's kind of almost like

37:12

a weird

37:13

um

37:14

um Motif that is not very rigorous so I

37:17

think one thing that's useful about this

37:19

argument is there's a pretty like

37:22

like strong like mathematical link all

37:24

the way down it goes like it starts at

37:26

solynoff's theory of induction which is

37:28

proven and then we have like a actual

37:31

mathematical link to an objective and

37:35

then

37:36

yeah it kind of like to lossless

37:38

compression and then it all kind of

37:39

links up so

37:41

um yeah I think another example would

37:43

even be like this this very I think it's

37:45

a great article but like the Ted Chang

37:46

article on uh lossless compression which

37:49

people haven't read I still recommend

37:50

reading I think

37:52

once you're not quite in a world where

37:54

like you have like a well-justified uh

37:57

motivation for doing something then

37:59

there's like lots of kind of confusion

38:01

about whether or not this whole approach

38:03

is even reasonable

38:04

um yeah so I think for me a lot of it's

38:07

about guidance but then on a more

38:09

practical level

38:10

um there are things that you can do that

38:11

would essentially kind of break uh you

38:13

would stop doing compression and you

38:15

might not notice it and then I think

38:17

this also guides you to like not do that

38:19

and I'll give you one example which is

38:21

something I've worked on personally

38:22

which is retrieval so for retrieval

38:24

augmented language models you can maybe

38:26

retrieve your whole training set and

38:28

then use that to try and improve your

38:30

predictions as you're going through now

38:32

if we think about compression one thing

38:34

that you can't do one thing that would

38:35

essentially cheating would be allow

38:37

yourself to retrieve over like future

38:39

tokens that you have not seen yet

38:41

um if you do that it's obvious like um

38:44

it might not be obvious immediately

38:45

because it was a tricky setup but in my

38:47

kind of like Satya Sundar encoding

38:50

decoding setup if you had some system

38:52

which can look to the Future that just

38:53

like won't work with that encoding

38:55

decoding setup and it also essentially

38:58

is cheating and

39:00

um

39:01

yeah so I think

39:02

essentially it's something which would

39:04

it could help your like test set

39:06

performance it might even make your

39:07

training loss look smaller but it

39:09

actually didn't improve your compression

39:11

and potentially you could fool yourself

39:12

into

39:14

um into like expecting a much larger

39:16

performance Improvement than you end up

39:17

getting in practice so I think sometimes

39:20

like you can help yourself

39:22

try and like set yourself up for

39:23

something that should actually

39:24

generalize better and do better on

39:25

Downstream evals than

39:28

um by kind of like thinking about this

39:31

kind of training objective

39:33

I see it also probably informs the type

39:37

of architectures you want to try because

39:39

if you're uh I think that that comments

39:42

about like the size of the code being

39:43

important is was really interesting

39:45

because if you need you know 17

39:47

different layers and every other uh and

39:50

every other a different module in every

39:52

layer or something that that kind of

39:54

increases the amount of information that

39:56

you need to communicate over

39:59

um yeah yeah

40:02

um it can be I could imagine on that

40:04

note like right now our setup is

40:07

essentially the prior information we put

40:09

into neural networks it's actually kind

40:10

of minuscule really and obviously

40:13

um with biological beings we have like

40:16

DNA we have like prior as like kind of

40:18

stored information which is is at least

40:20

larger than really what um the kind of

40:23

prize that we put into um

40:25

and neural networks I mean one thing

40:27

when I was first going through this I

40:29

was thinking maybe there should be more

40:31

kind of learned information that we

40:32

transfer between neural networks more of

40:35

a kind of like DNA

40:37

um and maybe like I mean we initialize

40:39

neural networks right now essentially

40:40

like gaussian noise with some a few

40:42

properties but like maybe if there was

40:44

some kind of like learned initialization

40:45

that we distill over many many different

40:46

types of ways of training neural

40:47

networks that wouldn't add to our size

40:50

of f too much but it might like mean

40:51

learning is just much faster so yeah

40:53

hopefully also the perspective might

40:55

like kind of spring out kind of

40:56

different and unique and creative like

40:58

themes of research

41:01

okay

41:02

um there there's another interesting

41:04

question from the class about the uses

41:06

of this kind of compression angle

41:09

um and the question is uh could could

41:12

the compression be good in some way by

41:14

allowing us to gain like what sorts of

41:16

higher level understanding or Focus

41:18

um on the important signal in the data

41:20

might we be able to get from the

41:23

um uh from from the lossy compression so

41:26

if we could like for example better

41:28

control the information being lost would

41:30

that allow us to gain any sort of higher

41:32

level understanding

41:34

um about kind of what what's important

41:35

in the data

41:38

um

41:40

so I think

41:44

that there is like a theme of research

41:46

trying to

41:48

um use essentially just like

41:51

the compressibility of data as at least

41:54

as a proxy for like quality

41:56

so that's one like very concrete theme

41:58

uh like

42:00

I mean this is pretty standard

42:02

pre-processing trick but

42:04

if your like data is just uncompressible

42:06

with a very simple text Express like

42:08

Giza as a data preprocessing tool then

42:11

maybe it's just like kind of random

42:12

noise and maybe you don't want to spend

42:13

any compute training or a large

42:15

Foundation model over it similarly I

42:18

think there's been

42:19

pieces of work there's a paper from 2010

42:22

that was like intelligent selection of

42:23

language model pre-training data or

42:25

something by Lewis and Moore and in that

42:27

one they look at

42:29

um they're trying to like select

42:30

training data that will be maximally

42:32

useful

42:33

um

42:33

for some Downstream tasks and

42:35

essentially what they do is they look at

42:37

like what data is best compressed

42:41

um when going from just like a regular

42:44

pre-trained language model to one that's

42:46

been specialized on that Downstream task

42:48

and they use that as a metric for data

42:49

selection they found that's like a very

42:50

good way of like selecting your data if

42:53

you just care about

42:56

training on a subset of your

42:57

pre-training data for a given Downstream

42:59

task so I think there's been some yeah

43:02

some kind of

43:03

sign of life in that area

43:08

um so uh one interesting question from

43:12

uh from the class

43:14

um so uh kind of related to uh I guess

43:18

how we code the the models versus how

43:21

they're actually executed yeah

43:23

um so uh so obviously when we write our

43:26

python code especially you know in pi

43:27

Torchic it all gets compiled down to

43:29

like Cuda kernels and and whatnot

43:32

um so how does that kind of like affect

43:34

uh your your understanding of how like

43:39

how much information is actually like in

43:41

the in these code like do you have to

43:43

take into account like the 17 different

43:45

Cuda kernels that you're running through

43:46

throughout the throughout the year yeah

43:48

this is a great question uh so um I

43:50

actually oh yeah I've got to mention

43:52

that in the talk but basically I do have

43:54

a link in the slides if the slides

43:55

eventually get shared there is a link

43:56

but I am basing

43:59

um what was quite convenient was there

44:01

is a Transformer code Base called nncp

44:04

which is like no dependencies on

44:05

anything it's just just like a I think a

44:08

single C plus plus

44:10

self-contained Library which builds a

44:12

Transformer and trains it and has a few

44:14

tricks in it like it has drop out has

44:16

like data shuffling things and that is

44:18

like 200 kilobytes like whole

44:20

self-contained so that is a good like

44:23

I'm using that as a bit of a proxy

44:25

obviously it the size of f is kind of

44:28

hard to

44:29

know for sure

44:31

um it's easy to overestimate like if you

44:34

um packaged up your like python code

44:37

like and you're using pi torch or

44:39

tensorflow it's going to import all

44:40

these libraries which aren't actually

44:41

relevant you'll you might have like

44:43

something really big you might have like

44:44

hundreds of megabytes for a gigabyte of

44:46

all this like packaged stuff together

44:48

and you might think oh therefore the

44:51

description my Transformer is actually

44:52

like you know hundreds of megabytes so

44:54

I'm just it was convenient that someone

44:56

specifically tried to

44:59

um find out how small we can make this

45:00

and they did it by building it

45:03

um from scratch eventually

45:05

cool

45:07

um we also had a question about the

45:09

hutter prize

45:10

um which I believe you you had something

45:12

in your side so the question is uh so it

45:15

appears that our largest language models

45:17

can now compress things better than

45:19

um than than the than the best header

45:21

prize so your question is is this

45:23

challenge still relevant

45:25

um yeah could you actually use the

45:27

algorithm that you suggest

45:29

um for for the hunter price yeah I'll

45:31

tell you exactly

45:32

um I mean this is something I've talked

45:33

with Marcus Hunter about the hood

45:35

surprise is like actually asking people

45:37

to do exactly the right thing but the

45:38

main issue was they it was focused on

45:41

compressing quite a small amount of data

45:43

and that date that amount of data was

45:45

fixed 100 megabytes now a lot of this

45:48

kind of perceptual roadmap is like

45:49

there's been a huge amount of benefit in

45:51

increasing

45:53

the amount of data and compute in

45:55

simultaneous

45:57

um and that and and by doing that we're

46:00

able to like continue like this training

46:02

loss curve is like getting lower you're

46:03

like

46:04

um your compression rates improving so

46:08

I would say the prize itself has not

46:11

um has just not been fruitful in like

46:14

actually promoting compression and

46:17

instead what ended up being the

46:18

Breakthrough was kind of like Bert slash

46:21

gpt2 which I think

46:24

um it's steered people to the benefit of

46:26

simultaneously essentially adopting this

46:28

workflow without necessarily naming its

46:30

compression

46:32

um I think yeah I think the Benchmark

46:33

just due to the compute limitations it

46:36

also requires it's very like outdated

46:37

something like needs like 100 maximum of

46:40

100 CPUs or something for like 48 hours

46:43

so I think essentially it didn't end up

46:45

creating an amazing like AI algorithm

46:47

but it was just because it really

46:49

underestimated the benefit of compute

46:51

like compute memory all that stuff it

46:54

turns out that's a big part of the story

46:56

of building powerful models so does that

47:00

reveal something about our current large

47:02

data sets that you kind of need to see

47:04

all this data before you can start

47:06

compressing the rest of it well yeah I

47:09

think well the cool thing is like

47:12

because the compression is the integral

47:14

in theory if you could have some

47:15

algorithm which you could learn faster

47:16

like initially that would actually have

47:19

better compression and it would be

47:21

something that you would expect it as a

47:24

result therefore that would suggest it

47:25

would kind of be a more intelligent

47:26

system and yeah I think like having

47:28

better data efficiency

47:30

is something we should really think

47:33

about strongly and I think there's

47:34

actually quite a lot of potential core

47:36

research to try and learn more from less

47:39

data uh and right now we're in

47:42

especially a lot of the big Labs I mean

47:44

there's a lot of data out there to to

47:46

kind of collect so I think maybe people

47:48

have just prioritized for now like oh it

47:50

feels like it's almost almost kind of

47:51

like an endless real data so we just

47:53

keep adding more data but then I think

47:55

there's without a doubt going to be a

47:56

lot more research focused on making more

47:58

of the data that we have

48:00

right

48:01

I wonder if you can speculate a little

48:03

bit about what this starts to look like

48:05

in I guess images and video I think you

48:09

had a slider or two at the end where

48:12

um well like as you mentioned that uh if

48:15

your data is not super g-zippable

48:18

um then that maybe there's a lot of

48:20

noise and uh I believe

48:22

um and and my intuition may be wrong but

48:24

I believe that images and or certainly

48:28

images they they appear to be a lot

48:30

larger a lot bigger than

48:33

um than than text so that doesn't have

48:36

these properties I've got a few useful

48:38

thoughts on this okay so one is we

48:41

currently have a huge limitation in our

48:42

architecture which is a Transformer or

48:45

even just like a deep content and that

48:47

is that the architecture does not adapt

48:50

in any way to the information content of

48:53

its inputs so what I mean by that is if

48:56

you have

48:57

[Music]

48:58

um

48:59

even if we have a bite level sequence of

49:03

Text data but we just represent it as

49:04

the bytes of a utf-8 and then instead we

49:07

have a bpe tokenized sequence and it

49:09

contains the exact same information but

49:11

it's just 4X shorter sequence length uh

49:14

the Transformer will just spend four

49:17

times more compute on the byte level

49:19

sequence if it was fed it and it'll

49:21

spend four times Less on the bpe

49:23

sequence of this feather even though

49:24

they have the same information content

49:26

so we don't have some kind of algorithm

49:28

which could like kind of fan out and

49:31

then just like process the byte level

49:33

sequence with the same amount of

49:35

approximate compute

49:37

and I think that really hurts images

49:39

like if we had some kind of architecture

49:40

that could quite gracefully try and like

49:43

think at the frequency of like useful

49:46

for uh no matter whether it's looking at

49:49

high definition image or quite a low

49:50

definition image or it's looking at 24

49:52

kilohertz audio or 16 kilohertz audio

49:55

just like we do I think we're very

49:57

graceful with things like that we have

50:00

kind of

50:01

like very like selective attention-based

50:03

Vision we are able to like process audio

50:06

and kind of we're able to like have a

50:09

kind of our own internal kind of

50:10

thinking frequency that works for us and

50:13

this is just something that's like a

50:14

clear limitation in our architecture so

50:17

yeah right now if you just model pixel

50:19

level with a Transformer very wasteful

50:20

and it's not something

50:22

um that's like the optimal thing to do

50:24

right now but given there's a clear

50:26

limitation on our architecture it's

50:28

possible it's still the right thing to

50:29

do it's just we need to figure out how

50:30

to do it efficiently

50:33

so does that suggest that a model that

50:35

could

50:36

um you know switch between different

50:37

resolutions uh like at the one token and

50:41

time resolution that's important for

50:42

text versus the

50:44

um I don't know I think you mentioned

50:45

you know the 24 kilohertz of audio does

50:48

that suggest that a module that a model

50:50

like that would uh be able to compress

50:53

like different modalities better

50:56

um and have you know higher sensory yeah

51:00

that's I think it's it would be crazy to

51:02

write it off at this stage anyway I

51:04

think a lot of people assume like oh

51:06

pixel level modeling it just doesn't

51:08

make sense on some fundamental level but

51:10

it's hard to know that whilst we still

51:12

have a big uh kind of fundamental

51:15

blocker with our best architecture so

51:18

yeah I think it's I wouldn't write it

51:20

off anyway

51:22

so Michael is slacking me he wants me to

51:24

ask if you follow the S4 line of work

51:26

yeah

51:27

yeah I think that's a really important

51:29

architecture

51:30

sorry go on

51:32

yeah I I was just uh so S4 uh so okay so

51:36

I guess for for those for those

51:37

listening S4 has a property where

51:40

um it's it was designed explicitly for

51:41

long sequences

51:43

um and one of the uh early uh set of uh

51:48

you know driving applications was this

51:50

pixel pixel by pixel image

51:52

classification

51:54

um sequential cfar uh that they called

51:56

it

51:57

um and uh one of the interesting things

52:00

that S4 can do is actually switch from

52:03

um the these different uh resolutions by

52:07

um uh by changing essentially some the

52:11

the parameterization a little bit

52:13

um

52:14

so does that suggest you that like

52:17

something like S4 or something with a

52:20

different

52:21

um you know encoding would uh would have

52:25

these like implications for I don't know

52:27

being more intelligent or or being a

52:29

better compressor of these other

52:31

modalities or something like that yeah

52:33

so like on a broad brushstroke like S4

52:36

allows you to maybe have a much longer

52:38

context uh than attention without paying

52:41

the quadratic compute cost uh there are

52:43

still other I don't think it solves

52:45

everything but I think it seems like a

52:47

very very promising like piece of

52:50

architecture development

52:52

um I think other parts are like even

52:54

within your MLP like linears in your

52:56

MLPs which are actually for a large

52:58

language than most of your compute

53:00

um you really want to be spending well

53:03

I'm saying I don't know this for sure

53:05

but it feels like there should be a very

53:07

non-uniform allocation of compute uh

53:09

depending on what is easy to think about

53:11

what it's hard to think about

53:12

um and so yeah if there's a more natural

53:15

way of

53:16

there was a cool paper called calm which

53:19

uh it was about early exiting like

53:22

essentially when neural network or some

53:24

intermediate layer feels like it's it's

53:26

it's done enough compute and it can now

53:28

just like skip all the way to the end

53:30

that was kind of an idea in that regime

53:32

but like this kind of adaptive compute

53:33

theme I think it could be a really

53:35

really big

53:36

[Music]

53:37

um

53:38

like

53:39

breakthrough towards this if we think of

53:41

our own thoughts it's like very it's

53:43

very sparse very non-uniform

53:45

and uh you know maybe some of that stuff

53:47

is written in From Evolution but but

53:49

yeah having like this incredibly

53:51

homogenous uniform compute for every

53:53

token uh it doesn't quite feel right so

53:56

yeah I think S4 is very cool I think it

53:57

could be could help in this direction

53:59

for sure

54:00

interesting uh we did get one more

54:03

question from the class that I wanted to

54:05

get your opinion on so the question is

54:07

do you think compression research for

54:09

the sake of compression uh is important

54:11

for these I guess for these like

54:13

intelligent simple implications

54:16

um reacting a little bit to the comments

54:17

on the hudder prize

54:19

um and it sounds like the compression

54:21

capabilities of the foundation models

54:22

are kind of byproducts instead of the

54:25

primary goal when training them

54:27

yeah so this is what I think I think um

54:30

the compression objective is the only

54:32

training objective that I know right now

54:34

uh which is completely non-gameable and

54:37

has a very rigorous Foundation of why it

54:40

should help us create better and better

54:42

generalizing agents in a better

54:43

perceptual system

54:45

however we should be continually

54:48

evaluating models based on their

54:51

capabilities which is fundamentally what

54:52

we care about and so the compression

54:55

like metric itself is one of the most

54:58

like harsh alien metrics you can look at

54:59

it's just a number that means almost

55:01

nothing to us and actually just as that

55:04

number goes down like say or should I

55:07

say the compression rate goes up or the

55:09

kind of bits per character say go down

55:11

it's very unobvious what's going to

55:13

happen

55:14

um so you have to have other evals where

55:16

we can try and like predict the

55:17

emergence of new capabilities or track

55:19

them because those are the things that

55:20

fundamentally people care about uh but I

55:23

think people that either do research in

55:26

this area or study at study at

55:28

University it's prestigious as Stanford

55:30

should have a good understanding of why

55:33

all of this makes sense

55:35

um but I still but I do think yeah that

55:37

doesn't necessarily means it needs to

55:39

completely go in everything about this

55:41

every research and doing research for

55:42

the impression itself I didn't think

55:44

it's necessarily the right way to think

55:46

about it

55:47

um yeah hopefully that answers that

55:49

question

55:50

I wonder if

55:52

um that has implications for things like

55:54

training for more than one Epoch uh I

55:57

think somehow the field recently has

56:01

um uh arrived at the idea that you

56:03

should only you know see all your

56:04

training data once

56:06

um yeah I've got response to that so

56:08

actually training for more than one

56:09

Epoch is not um it's not like if you do

56:13

it literally yeah then it doesn't really

56:15

make sense from a compression

56:16

perspective because once you've finished

56:18

your epoch you can't count the log loss

56:21

of your second Epoch towards your

56:22

compression objective because a very

56:24

powerful model by that point if you did

56:26

it could just have like say users

56:28

retrieval let's memorize everything

56:29

you've seen and then it's just going to

56:31

get perfect performance from then on

56:32

that obviously is not creating a more

56:34

intelligent system but it might like

56:36

it'll minimize your training also make

56:38

you feel good about yourself

56:40

um so at the same time yeah training is

56:42

more than one Epoch can give you better

56:43

generalization what's happening

56:45

um

56:46

I think the way to think about it is the

56:49

ideal setup would be like in RL you have

56:51

this initial replay so you're going

56:52

through you're going through your Epoch

56:54

in theory like all you can count towards

56:56

your like compression score is your

56:58

prediction for the next held out piece

56:59

of training data but there's no reason

57:01

why you couldn't then actually chew up

57:03

and like spend more SGD steps on like

57:06

past data so I think in the compression

57:08

setup multi Epoch just looks like replay

57:10

essentially now in practice I think just

57:13

pragmatically it's easier to just train

57:14

with with multiple epochs

57:16

um you know so yeah I think I just want

57:19

to clear up like compression does not

57:21

it's not actually synonymous with only

57:22

training for one Epoch because you can

57:24

still do replay and essentially see your

57:25

data multiple times but it basically

57:27

says you can only like

57:29

score yourself for all of the

57:31

predictions which will let your next

57:32

batch of data have held out data that's

57:35

the only thing that's the fair thing

57:36

just came out of school

57:38

hopefully

57:40

so we're nearing the end of the hour so

57:43

I wanted to just give you a chance uh if

57:45

there's anything

57:47

um you know that you're excited about

57:48

coming out uh anything in the pipeline

57:50

that that you wanted to talk about and

57:52

just wanted to give you a chance to kind

57:54

of give a preview of what may be next in

57:56

this area uh and kind of uh what's

57:59

coming up and exciting for you

58:04

um

58:05

um okay

58:06

well

58:09

I think 2023

58:12

it doesn't need me to really sell it

58:14

very much I think it's going to be

58:15

pretty much like every week something

58:16

amazing is going to happen so

58:19

um if not every week then every two

58:21

weeks the pace of innovation right now

58:23

I'm sure as you're very aware is pretty

58:25

incredible I think there's going to be

58:27

lots of stuff

58:28

amazing stuff coming out from companies

58:31

in the Bay Area such as open AI uh and

58:34

around the world in in Foundation models

58:37

both in the development of stronger ones

58:40

but also this incredible amount of

58:42

Downstream research that there's just

58:44

such a huge community of people using

58:46

these things now tinkering with them

58:47

exploring capabilities so yeah I feel

58:50

like we're kind of in a in a cycle of

58:53

mass

58:55

um Innovation so I think yeah it's just

58:58

strap in and try not to get too

59:00

overwhelmed

59:02

yeah

59:03

it's a bit it's looking to be a very

59:06

exciting year

59:07

absolutely

59:09

right

59:10

um yeah so that brings us to the end of

59:13

the hour so I wanted to thank you Jack

59:14

again for coming on it was a very

59:16

interesting talk uh thanks of course

59:18

everybody who's listening online and in

59:20

the class for for your great questions

59:22

um if uh this Wednesday we're gonna have

59:24

Susan John from meta she's going to be

59:25

talking a little bit about the trials of

59:27

training uh opt 175 billion so that

59:30

would be very interesting for us to uh

59:33

to talk to her and hear about

59:35

um if you want to you can go to our

59:36

website

59:38

mlsys.stanford.edu to see the rest of

59:40

our schedule I believe we only have one

59:42

more week left

59:43

um so it's been it's been an exciting

59:45

quarter thank you of course everyone for

59:47

participating

59:49

um when with that we will uh wave

59:51

goodbye and say goodbye to YouTube