Compression for AGI - Jack Rae | Stanford MLSys #76

Stanford MLSys Seminars
27 Feb 202359:53

Summary

TLDRIn episode 76 of the Stanford MLS seminar series, the focus is on the intriguing intersection of compression and AGI (Artificial General Intelligence), featuring guest speaker Jack Ray from OpenAI. The talk delves into foundational models and their significant role in shaping the future of machine learning, emphasizing the importance of understanding their training objectives, limitations, and potential. Jack Ray presents a detailed exploration of compression as a key to unlocking AGI, discussing generative models as lossless compressors and highlighting the concept of minimum description length. Through this insightful discussion, the seminar sheds light on the intricate dynamics of foundation models, urging the audience to think deeply about their applications and the broader implications for AI research.

Takeaways

  • 😃 Large language models like GPT-3 are state-of-the-art lossless compressors, able to compress data at rates better than traditional algorithms like gzip.
  • 🤔 The minimum description length principle, which aims to find the smallest possible representation of data, has deep philosophical roots and may be key to achieving artificial general intelligence (AGI).
  • 🧐 Training large language models is essentially a process of lossless compression, where the objective is to minimize the number of bits required to encode the training data.
  • 💡 Scaling up model size and training data can lead to better compression and potentially improved generalization, but algorithmic advances beyond just scaling are also important.
  • ⚠️ While compression is a rigorous objective, evaluating models solely on compression metrics may be uninformative, and tracking emergent capabilities is crucial.
  • 🔍 Arithmetic encoding provides a way to losslessly compress data using a language model's predictions, though the process is computationally expensive.
  • ✨ Architectures that can adaptively allocate compute based on input complexity, like the S4 model, may be important for efficiently compressing multi-modal data like images and audio.
  • 🚧 Lossy compression, while related, is distinct from the lossless compression objective and may not lead to better generalization.
  • 🔑 The description length of a model itself (e.g., the code to instantiate it) is typically small compared to the compressed data size, regardless of model scale.
  • 🌱 Future breakthroughs in areas like data efficiency, adaptive compute, and new architectures could lead to further paradigm shifts in compression and generalization capabilities.

Q & A

  • What is the main topic of the talk?

    -The main topic of the talk is compression for artificial general intelligence (AGI), and how techniques like lossless compression using large language models can potentially help in solving perception and generalization problems.

  • Why is the minimum description length principle important according to the speaker?

    -The speaker argues that seeking the minimum description length of data may be an important principle in solving perception and generalizing well, as it has a rigorous mathematical foundation dating back to philosophers like Aristotle and William of Ockham.

  • How are large language models related to lossless compression?

    -The speaker explains that large language models are actually state-of-the-art lossless compressors, as training them involves minimizing the negative log-likelihood over the training data, which is equivalent to lossless compression.

  • Can you explain the example of Satya and Sundar used to illustrate lossless compression?

    -The example involves Satya encoding a dataset using a trained language model and arithmetic coding, and sending the encoded transcripts and model code to Sundar. Sundar can then reconstruct the original dataset by running the code and using arithmetic decoding with the predicted token probabilities.

  • What is the potential recipe for solving perception and moving towards AGI according to the speaker?

    -The recipe is to first collect all useful perceptual information, and then learn to compress it as best as possible with a powerful foundation model, through techniques like scaling data and compute, or algorithmic advances.

  • What is the main limitation of the compression approach mentioned by the speaker?

    -One limitation is that modeling and compressing everything at a low level (e.g., pixels for images) may be computationally expensive and inefficient, so some form of filtering or semantic understanding may be needed first.

  • How does the speaker view the role of reinforcement learning in relation to compression?

    -The speaker notes that while compression is important for observable data, reinforcement learning and on-policy behavior are still crucial for gathering useful information that may not be directly observable.

  • What is the speaker's opinion on the Hutter Prize for lossless compression?

    -The speaker believes that while the Hutter Prize aims to promote compression, it has not been fruitful because it focuses on compressing a small, fixed amount of data, underestimating the benefits of scaling data and compute.

  • How does the compression perspective inform the development of new architectures?

    -The speaker suggests that the compression perspective could inspire research into architectures that can adapt their compute and attention based on the information content of the input, similar to how biological systems allocate resources non-uniformly.

  • What is the speaker's overall view on the importance of compression research?

    -The speaker believes that while the compression objective provides a rigorous foundation for generalization, the primary focus should be on evaluating and tracking the emergence of new capabilities in models, as those are ultimately what people care about.

Outlines

00:00

🎉 Introduction to Compression and AGI Seminar

The Stanford MLS seminar series introduces a talk by Jack Ray from OpenAI, focusing on compression for Artificial General Intelligence (AGI). The seminar highlights the partnership with CS324 on advances in Foundation Models. Participants are encouraged to engage and ask questions via YouTube chat or Discord. The session promises insightful discussions on the training objectives of foundation models, their limitations, and the significance of compression in the context of AGI.

05:01

📊 Foundation Models and Minimum Description Length

This section delves into the concept of minimum description length (MDL) and its relevance to understanding and improving foundation models. Jack Ray discusses the historical and philosophical underpinnings of seeking the minimum description length for data compression and generalization, referencing Solonoff's theory of inductive inference. The segment also explores generative models as lossless compressors, highlighting how large language models, despite their size, excel in state-of-the-art lossless data compression.

10:01

🔍 Exploring Lossless Compression with Large Language Models

Jack Ray elucidates the mechanics of lossless compression in large language models through a detailed example involving LLaMA models. He demonstrates that larger models, such as the 65 billion parameter version, achieve better compression, hence suggesting superior generalization capabilities. The talk emphasizes the counterintuitive nature of large language models being efficient lossless compressors and explains the mathematical basis for evaluating the compression efficiency of these models.

15:01

🌐 Arithmetic Encoding and Model Training

The seminar continues with an in-depth discussion on arithmetic encoding as a method for data compression. Through a hypothetical scenario involving two individuals, Satya and Sundar, Ray illustrates how arithmetic encoding and decoding work in tandem with a generative model to achieve lossless compression of a dataset. This process underlines the non-dependence of compression efficiency on the size of the neural network but rather on the model's ability to predict next tokens accurately.

20:02

📈 Towards AGI: The Importance of Compression

Jack Ray outlines a two-step approach towards achieving AGI: collecting useful perceptual information and compressing it efficiently using powerful foundation models. He argues that any research method improving compression can advance capabilities towards better perception, supporting the idea with examples of how lossless compression aids in understanding and generalization. Ray also addresses common confusions regarding lossy vs. lossless compression and their implications for neural networks.

25:03

🚀 The Future of Compression in AI Research

In the final part of the seminar, Ray explores potential limitations and future directions for compression in AI research. He touches on practical challenges, such as the computational expense of pixel-level image modeling, and the need for novel architectures that adapt to the informational content of inputs. The discussion concludes with reflections on the integral role of compression in driving advancements in AI and the continuous pursuit of algorithmic improvements alongside computational scaling.

Mindmap

Keywords

💡Compression

Compression refers to the process of encoding information using fewer bits or a smaller representation. In the context of this video, compression is presented as a key objective for training large language models and foundation models. The speaker argues that as these models improve, they are essentially providing better lossless compression of the training data, which should in turn enable better generalization to new data. Compression is viewed as a principled approach to achieving artificial general intelligence (AGI) by learning to compress all useful perceptual information.

💡Minimum Description Length

The minimum description length (MDL) principle states that the best model or representation for a given dataset is the one that leads to the most compressed or succinct description of the data. The speaker roots the compression objective in philosophical principles dating back to Aristotle, as well as more recent work by scholars like Solomonoff and Rissanen. MDL is presented as a rigorous foundation for why compression should help create better generalizing agents and improve perception.

💡Generative Models

Generative models are machine learning models that aim to capture the underlying probability distribution of the training data, allowing them to generate new samples that plausibly belong to the same distribution. In this talk, the speaker argues that generative models like large language models are actually state-of-the-art lossless compressors. By accurately modeling the probability distribution of the data, these models can effectively compress the data through techniques like arithmetic coding.

💡Arithmetic Coding

Arithmetic coding is a technique for lossless data compression. It encodes data by representing it as a small numeric value within a fixed interval, where more probable data is assigned a smaller interval. The speaker uses arithmetic coding as a thought experiment to demonstrate how a generative model can be used to losslessly compress and decompress data, with the model providing the probability estimates needed for coding and decoding. This illustrates how language models are optimizing for compression.

💡Scaling

Scaling refers to the practice of increasing the size or capacity of machine learning models, typically by adding more parameters or increasing the amount of training data and computational resources. The speaker acknowledges that scaling compute, data, and model size has been a major driver of recent progress in large language models and their compression capabilities. However, he also notes that algorithmic advances beyond simply scaling will likely be needed for further paradigm shifts in compression and capabilities.

💡Foundation Models

Foundation models, also known as large language models or generative models, are powerful neural networks trained on vast amounts of data to learn generalizable representations and capabilities. The talk focuses on the role of these models as state-of-the-art compressors and their potential for advancing artificial general intelligence (AGI) by learning to compress all useful perceptual information. Examples discussed include models like GPT-3, LLaMA, and BERT.

💡AGI (Artificial General Intelligence)

Artificial general intelligence (AGI) refers to the development of artificial systems with general intelligence comparable to humans, capable of reasoning, learning, and adapting across a wide range of cognitive tasks. The speaker presents compression as a principled approach for working towards AGI, by collecting useful perceptual information and training powerful foundation models to compress and generalize from this data. AGI is portrayed as the long-term goal that compression research could help achieve.

💡Perception

Perception refers to the ability to acquire, interpret, and understand sensory information and stimuli from the environment. In the context of this talk, the speaker discusses the role of compression in solving perception, which is viewed as a key challenge on the path towards artificial general intelligence (AGI). By learning to compress all useful perceptual information, such as text, images, and audio, foundation models could develop a better understanding and generalization of the world.

💡Lossless Compression

Lossless compression refers to data compression techniques that allow the original data to be perfectly reconstructed from the compressed representation. The speaker argues that generative models like large language models are actually performing lossless compression on their training data, in contrast to lossy compression techniques that discard some information. This lossless compression property is presented as a key feature that enables these models to effectively generalize and understand their training data.

💡Retrieval

Retrieval refers to the ability of a model to access and utilize information from its training data or external sources during inference or generation. The speaker notes that unconstrained retrieval over future data not yet seen during training would be considered cheating from a compression perspective, as it would allow the model to achieve perfect performance without actually compressing and generalizing from the data. Appropriate use of retrieval is discussed as a potential enhancement for compression, but only if done in a principled way.

Highlights

Compression is a has been a objective that actually we are generally striving towards as we build better and larger models which may be counter-intuitive, given the models themselves can be very large.

Generative models are actually lossless compressors and specifically large language models are actually state of the art lossless compressors which may be a counter-intuitive point to many people.

Race Islanders' theory of inductive inference states that if you have a universe of data generated by an algorithm and observations of that universe encoded as a data set, they are best predicted by the smallest executable Archive of that data set, known as the minimum description length.

The size of the lossless compression of a data set can be characterized as the negative log likelihood from a generative model evaluated over the data set, plus the description length of the generative model.

Generative models like large language models are state-of-the-art lossless compressors, able to compress datasets like the one used to train the 65B parameter LLaMA model by 14x compared to the original data size.

Arithmetic encoding allows mapping a token to a compressed transcript using exactly -log2(p) bits, where p is the model's predicted probability for that token. Arithmetic decoding can recover the original token from the transcript if the probability distribution is known.

Larger models trained for more compute steps tend to achieve better compression, explaining their superior generalization performance despite increased model size.

Retrieval-augmented language models that can look ahead at future tokens would be "cheating" from a compression standpoint and may fool performance metrics without true generalization gains.

Model architectures that can dynamically allocate compute based on information content, similar to how human perception works, could improve the inefficiency of current models that spend uniform compute on all inputs.

Pixel-level image and video modeling is very compute-intensive with current architectures but may be viable with architectures that can gracefully process inputs at the appropriate "thinking frequency".

The Hutter prize's small 100MB data limit failed to incentivize meaningful compression research, while the transition to large language models provided a bigger boost.

While compression is a rigorous objective, model capabilities that people fundamentally care about should be continually evaluated alongside compression metrics.

Training for multiple epochs may be justified from a compression perspective if treated as a form of replay, where only predictions on held-out data are scored.

S4 and other architectures that enable longer context lengths and adaptive computation could help model different modalities like audio and images more efficiently.

The pace of innovation in foundation models and their applications is incredibly rapid, with amazing developments expected weekly or bi-weekly in 2023.

Transcripts

00:02

hello everyone and welcome to episode 76

00:06

of the Stanford MLS seminar series

00:08

um today of course we're or this year

00:10

we're very excited to be partnered with

00:12

cs324 advances in Foundation models

00:15

um today I'm joined by Michael say hi

00:19

and ivonica

00:21

um and today our guest is Jack Ray from

00:24

openai and he's got a very exciting talk

00:26

uh prep for us about compression and AGI

00:30

um so so we're very excited to listen to

00:32

him as always if if you have questions

00:34

you can post them in YouTube chat or if

00:36

you're in the class there's that Discord

00:37

Channel

00:38

um so so to keep the questions coming

00:40

and after his talk we will we'll have a

00:42

great discussion

00:43

um so with that Jack take it away

00:47

okay fantastic thanks a lot

00:52

and right

00:56

okay so

00:58

um today I'm going to talk about

01:00

compression for AGI and the theme of

01:02

this talk is that I want people to kind

01:05

of think deeply about uh Foundation

01:09

models and their training objective and

01:12

think deeply about kind of what are we

01:14

doing why does it make sense what are

01:17

the limitations

01:18

um

01:19

this is quite a important topic at

01:22

present I think there's a huge amount of

01:25

interest in this area in Foundation

01:27

models large language models their

01:28

applications and a lot of it is driven

01:31

very reasonably just from this principle

01:33

that it works and it works so it's

01:34

interesting but if we just kind of sit

01:37

within the kind of it works realm it's

01:40

hard to necessarily predict or have a

01:43

good intuition of why it might work or

01:45

where it might go

01:48

so some takeaways that I want so I hope

01:50

people like people hopefully to take

01:52

from this tour car some of them are

01:54

quite pragmatic so I'm going to talk

01:57

about some background on the minimum

01:58

description length and why it's seeking

02:01

the minimum description length of our

02:03

data may be an important role in solving

02:05

perception uh I want to make a

02:08

particular point that generative models

02:10

are actually lossless compressors and

02:12

specifically large language models are

02:15

actually state of the art lossless

02:16

compressors which may be a

02:19

counter-intuitive point to many people

02:20

given that they are very large and use a

02:23

lot of space and I'm going to unpack

02:25

that

02:26

in detail and then I'm also going to

02:29

kind of end on some notes of limitations

02:32

of the approach of compression

02:35

so

02:37

let's start with this background minimum

02:38

description length and why it relates to

02:40

perception so

02:42

even going right back to the kind of

02:44

ultimate goal of learning from data we

02:48

may have some set of observations that

02:50

we've collected some set of data that we

02:52

want to learn about which we consider

02:55

this small red circle

02:57

and we actually have a kind of a

03:00

two-pronged goal we want to learn like

03:02

uh how to kind of predict and understand

03:05

our observed data with the goal of

03:09

understanding and generalizing to a much

03:10

larger set of Universe of possible

03:12

observations so we can think of this as

03:16

if we wanted to learn from dialogue data

03:19

for example we may have a collection of

03:21

dialogue transcripts but we don't

03:23

actually care about only learning about

03:25

those particular dialogue transcripts we

03:27

want to then be able to generalize to

03:29

the superset of all possible valid

03:31

conversations that a model may come

03:33

across right so

03:36

what is an approach what is a very like

03:38

rigorous approach to trying to learn to

03:41

generalize well I mean this has been a

03:43

philosophical question for multiple

03:45

thousands of years

03:47

um

03:48

and even actually kind of full Century

03:51

BC uh there's like some pretty good

03:53

um principles that philosophers are

03:56

thinking about so Aristotle had this

03:59

notion of

04:00

um

04:02

assuming the super superiority of the

04:04

demonstration which derives from fewer

04:06

postulates or hypotheses so this notion

04:09

of uh we have some

04:11

[Music]

04:11

um

04:12

um simple set of hypotheses

04:15

um

04:16

then this is probably going to be a

04:18

superior description of a demonstration

04:21

now this kind of General kind of simpler

04:23

is better

04:25

um

04:26

theme is more recently attributed to

04:29

William 14th century or Cam's Razer this

04:33

is something many people may have

04:34

encountered during a machine learning or

04:36

computer science class

04:38

he is essentially continuing on this

04:40

kind of philosophical theme the simplest

04:42

of several competing explanations is

04:44

always likely likely to be the correct

04:46

one

04:47

um now I think we can go even further

04:50

than this within machine learning I

04:52

think right now Occam's razor is almost

04:54

used to defend almost every possible

04:56

angle of research but I think one

04:58

actually very rigorous incarnation of

05:00

what comes Razer is from race Island's

05:04

theory of inductive inference 1964. so

05:06

we're almost at the present day and he

05:08

says something quite concrete and

05:09

actually mathematically proven which is

05:11

that if you have a universe of data

05:13

which is generated by an algorithm and

05:15

observations of that universe so this is

05:17

the small red circle

05:19

encoded as a data set are best predicted

05:21

by the smallest executable Archive of

05:23

that data set so that says the smallest

05:25

lossless prediction or otherwise known

05:28

as the minimum description length so I

05:30

feel like that final one is actually

05:31

putting into mathematical and quite

05:33

concrete terms

05:34

um these kind of Notions that existed

05:37

through timing velocity

05:38

and it kind of we could even relate this

05:40

to a pretty I feel like that is a quite

05:43

a concrete and actionable retort to this

05:46

kind of

05:47

um quite

05:48

um murky original philosophical question

05:51

but if we even apply this to a

05:52

well-known philosophical problem cells

05:54

Chinese room 4 experiment where there's

05:57

this notion of a computer program or

05:58

even a person kind of with it within a

06:01

room that is going to perform

06:02

translation from English English to

06:05

Chinese and they're going to

06:07

specifically use a complete rulebook of

06:10

all possible

06:12

inputs or possible say English phrases

06:15

they receive and then and then the

06:16

corresponding say Chinese translation

06:18

and the original question is does this

06:20

person kind of understand how to perform

06:22

translation uh and I think actually this

06:24

compression argument this race on this

06:26

compression argument is going to give us

06:28

something quite concrete here so uh this

06:31

is kind of going back to the small red

06:32

circle large white circle if if we have

06:35

all possible translations and then we're

06:38

just following the rule book this is

06:39

kind of the least possible understanding

06:41

we can have of translation if we have

06:42

such a giant book of all possible

06:44

translations and it's quite intuitive if

06:46

we all we have to do is coin a new word

06:49

or have a new phrase or anything which

06:50

just doesn't actually fit in the

06:52

original book this system will

06:54

completely fail to translate because it

06:56

has the least possible understanding of

06:58

translation and it has the least

06:59

understandable version of translation

07:02

because that's the largest possible

07:03

representation of the the task the data

07:06

set however if we could make this

07:08

smaller maybe we kind of distill

07:12

sorry we distill this to a smaller set

07:13

of rules some grammar some basic

07:15

vocabulary and then we can execute this

07:17

program maybe such a system has a better

07:19

understanding of translation so we can

07:21

kind of grade it based on how compressed

07:23

this rulebook is and actually if we

07:24

could kind of compress it down to the

07:27

kind of minimum description like the

07:28

most compressed format the task we may

07:30

even argue such a system has the best

07:32

possible understanding of translation

07:35

um now for foundation models we

07:38

typically are in the realm where we're

07:39

talking about generator model one that

07:40

places probability on natural data and

07:43

what is quite nice is we can actually

07:44

characterize the lossless compression of

07:46

a data set using a generator model in a

07:48

very precise mathematical format so race

07:51

on enough says we should try and find

07:53

the minimum description length well we

07:55

can actually try and do this practically

07:57

with a generator model so the size the

08:00

lossless compression of our data set D

08:02

can be characterized as the negative log

08:05

likelihood from a genetic model

08:06

evaluated over D plus the description

08:09

length of this generator model so for a

08:14

neural network we can think of this as

08:15

the amount of code to initialize the

08:17

neural network

08:18

that might actually be quite small

08:21

this is not actually something that

08:23

would be influenced by the size of the

08:24

neural network this would just be the

08:26

code to actually instantiate it so it

08:29

might be a couple hundred kilobytes to

08:31

actually Implement a code base which

08:32

trains a transformer for example and

08:35

actually this is quite a surprising fact

08:37

so what does this equation tell us does

08:40

it tell us anything new well I think it

08:42

tells us something quite profound the

08:44

first thing is we want to minimize this

08:46

general property and we can do it by two

08:48

ways one is via having a generative

08:51

model which has better and better

08:52

performance of our data set that is a

08:54

lower and lower negative log likelihood

08:55

but also we are going to account for the

08:58

prior information that we inject into F

09:01

which is that we can't stuff F full of

09:04

priors such that maybe it gets better

09:06

performance but overall it does not get

09:08

a bit of a compression

09:10

um so

09:12

on that note yeah compression is a a

09:15

cool way of thinking about

09:17

how we should best model our data and

09:19

it's actually kind of a non-gameable

09:21

objective so contamination is a big

09:24

problem within uh machine learning and

09:27

trying to evaluate progress is often

09:29

hampered by Notions of whether or not

09:31

test sets are leaked into training sense

09:33

well with compression this is actually

09:36

not not something we can game so imagine

09:39

we pre-trained F on a whole data set D

09:42

such that it perfectly memorizes the

09:44

data set

09:45

AKA such that the probability of D is

09:48

one log probability is zero in such a

09:51

case if we go back to this formula the

09:53

first term will zip to zero

09:56

however now essentially by doing that by

09:58

injecting and pre-training our model on

10:01

this whole data set we have to add that

10:03

to the description length of our

10:04

generative model so now F not only

10:06

contains the code to train it Etc but it

10:08

also contains essentially a description

10:10

length of d

10:11

so in this setting essentially a

10:12

pre-contaminating f it does not help us

10:15

optimize the compression

10:18

and this contrasts to regular test set

10:20

benchmarking where we may be just

10:22

measuring test set performance and

10:24

hoping that measures generalization and

10:26

is essentially a proxy for compression

10:27

and it can be but also we can find lots

10:30

and lots of scenarios where we

10:31

essentially have variations of the test

10:33

set that have slipped through the net in

10:35

our training set and actually even right

10:37

now within Labs comparing large language

10:40

models this notion of contamination

10:42

affecting eval resources a continual

10:45

kind of phone in um in in the side of

10:48

kind of clarity

10:49

Okay so we've talked about philosophical

10:52

backing of the minimum description

10:54

length and maybe why it's a sensible

10:56

objective

10:58

and now I'm going to talk about it

10:59

concretely for large language models and

11:01

we can kind of map this to any uh

11:04

generative model but I'm just going to

11:06

kind of ground it specifically in the

11:07

marsh language model so if we think

11:10

about what is the log problem of our

11:11

data D well it's the sum of our next

11:14

token prediction of tokens over our data

11:18

set

11:19

[Music]

11:19

um

11:20

so this is something that's essentially

11:22

our training objective if we think of

11:24

our data set D

11:25

um and we have one Epoch then this is

11:28

the sum of all of our training loss so

11:30

it's pretty tangible term it's a real

11:31

thing we can measure and F is the

11:33

description length of our

11:35

Transformer language model uh and

11:38

actually there are people that have

11:39

implemented a Transformer and a training

11:41

regime just without any external

11:43

libraries in about I think 100 to 200

11:45

kilobytes so this is actually something

11:47

that's very small

11:49

um and and as I said I just want to

11:51

enunciate this this is something where

11:53

it's not dependent on the size of our

11:55

neural network so if a piece of code can

11:57

instantiate a 10 layer Transformer the

12:00

same piece of code you can just change a

12:02

few numbers in the code it can

12:03

instantiate a 1000 layer Transformer

12:05

actually the description length of our

12:07

initial Transformer is unaffected really

12:10

by how large the actual neural network

12:13

is we're going to go through an example

12:15

of actually using a language model to

12:16

losslessly compress where we're going to

12:18

see why this is the case

12:21

okay so let's just give like a specific

12:23

example and try and ground this out

12:25

further so okay llama it was a very cool

12:28

paper that came out from fair just like

12:29

late last week I was looking at the

12:32

paper here's some training curves

12:34

um now forgetting the smaller two models

12:37

there are the two largest models are

12:39

trained on one Epoch of their data set

12:41

so actually we could sum their training

12:43

losses uh AKA this quantity

12:47

and we can also roughly approximate the

12:50

size of of the um of the code base that

12:53

was used to train them

12:56

um and therefore we can see like okay

12:58

which of these two moles the 33b or the

13:00

65b is the better compressor and

13:01

therefore which would we expect to be

13:03

the better model at generalizing and

13:05

having greater set of capabilities so

13:09

it's pretty it's going to be pretty

13:11

obvious at 65b I'll tell you why firstly

13:13

just to drum this point home these

13:16

models all have the same description

13:17

length they have different number of

13:18

parameters but the code that's used to

13:20

generate them is actually of same of the

13:23

same complexity however they don't have

13:25

the same integral of the training loss

13:28

65b has a smaller integral Windows

13:31

training loss

13:32

and therefore if we plug if we sum these

13:35

two terms we would find that 65b

13:36

essentially creates the more concise

13:39

description of its training data set

13:42

okay so that might seem a little bit

13:43

weird I'm going to even plug some actual

13:44

numbers in let's say we assume it's

13:46

about one megabyte for the code to

13:48

instantiate and train the Transformer

13:50

and then if we actually just calculate

13:53

this roughly it looks to be about say

13:55

400 gigabytes

13:57

um

13:58

you have some of your log loss

13:59

converting into bits and then bytes it's

14:02

going to be something like 400 gigabytes

14:03

and this is from an original data set

14:06

which is about 5.6 terabytes of rortex

14:08

so 1.4 trillion tokens times four is

14:11

about 5.6 terabytes so that's a

14:13

compression rate of 14x

14:15

um the best text compressor on the

14:17

Hudson prize is 8.7 X so the takeaway of

14:20

this point is

14:21

um actually as we're scaling up and

14:24

we're creating more powerful models and

14:25

we're training them on more data we're

14:27

actually creating something which

14:29

actually is providing a lower and lower

14:31

lossless compression of our data even

14:34

though the intermediate model itself may

14:36

be very large

14:40

okay so now I've talked a bit about how

14:43

large language models are state of the