# Compression for AGI - Jack Rae | Stanford MLSys #76

### Summary

TLDRIn episode 76 of the Stanford MLS seminar series, the focus is on the intriguing intersection of compression and AGI (Artificial General Intelligence), featuring guest speaker Jack Ray from OpenAI. The talk delves into foundational models and their significant role in shaping the future of machine learning, emphasizing the importance of understanding their training objectives, limitations, and potential. Jack Ray presents a detailed exploration of compression as a key to unlocking AGI, discussing generative models as lossless compressors and highlighting the concept of minimum description length. Through this insightful discussion, the seminar sheds light on the intricate dynamics of foundation models, urging the audience to think deeply about their applications and the broader implications for AI research.

### Takeaways

- 😃 Large language models like GPT-3 are state-of-the-art lossless compressors, able to compress data at rates better than traditional algorithms like gzip.
- 🤔 The minimum description length principle, which aims to find the smallest possible representation of data, has deep philosophical roots and may be key to achieving artificial general intelligence (AGI).
- 🧐 Training large language models is essentially a process of lossless compression, where the objective is to minimize the number of bits required to encode the training data.
- 💡 Scaling up model size and training data can lead to better compression and potentially improved generalization, but algorithmic advances beyond just scaling are also important.
- ⚠️ While compression is a rigorous objective, evaluating models solely on compression metrics may be uninformative, and tracking emergent capabilities is crucial.
- 🔍 Arithmetic encoding provides a way to losslessly compress data using a language model's predictions, though the process is computationally expensive.
- ✨ Architectures that can adaptively allocate compute based on input complexity, like the S4 model, may be important for efficiently compressing multi-modal data like images and audio.
- 🚧 Lossy compression, while related, is distinct from the lossless compression objective and may not lead to better generalization.
- 🔑 The description length of a model itself (e.g., the code to instantiate it) is typically small compared to the compressed data size, regardless of model scale.
- 🌱 Future breakthroughs in areas like data efficiency, adaptive compute, and new architectures could lead to further paradigm shifts in compression and generalization capabilities.

### Q & A

### What is the main topic of the talk?

-The main topic of the talk is compression for artificial general intelligence (AGI), and how techniques like lossless compression using large language models can potentially help in solving perception and generalization problems.

### Why is the minimum description length principle important according to the speaker?

-The speaker argues that seeking the minimum description length of data may be an important principle in solving perception and generalizing well, as it has a rigorous mathematical foundation dating back to philosophers like Aristotle and William of Ockham.

### How are large language models related to lossless compression?

-The speaker explains that large language models are actually state-of-the-art lossless compressors, as training them involves minimizing the negative log-likelihood over the training data, which is equivalent to lossless compression.

### Can you explain the example of Satya and Sundar used to illustrate lossless compression?

-The example involves Satya encoding a dataset using a trained language model and arithmetic coding, and sending the encoded transcripts and model code to Sundar. Sundar can then reconstruct the original dataset by running the code and using arithmetic decoding with the predicted token probabilities.

### What is the potential recipe for solving perception and moving towards AGI according to the speaker?

-The recipe is to first collect all useful perceptual information, and then learn to compress it as best as possible with a powerful foundation model, through techniques like scaling data and compute, or algorithmic advances.

### What is the main limitation of the compression approach mentioned by the speaker?

-One limitation is that modeling and compressing everything at a low level (e.g., pixels for images) may be computationally expensive and inefficient, so some form of filtering or semantic understanding may be needed first.

### How does the speaker view the role of reinforcement learning in relation to compression?

-The speaker notes that while compression is important for observable data, reinforcement learning and on-policy behavior are still crucial for gathering useful information that may not be directly observable.

### What is the speaker's opinion on the Hutter Prize for lossless compression?

-The speaker believes that while the Hutter Prize aims to promote compression, it has not been fruitful because it focuses on compressing a small, fixed amount of data, underestimating the benefits of scaling data and compute.

### How does the compression perspective inform the development of new architectures?

-The speaker suggests that the compression perspective could inspire research into architectures that can adapt their compute and attention based on the information content of the input, similar to how biological systems allocate resources non-uniformly.

### What is the speaker's overall view on the importance of compression research?

-The speaker believes that while the compression objective provides a rigorous foundation for generalization, the primary focus should be on evaluating and tracking the emergence of new capabilities in models, as those are ultimately what people care about.

### Outlines

### 🎉 Introduction to Compression and AGI Seminar

The Stanford MLS seminar series introduces a talk by Jack Ray from OpenAI, focusing on compression for Artificial General Intelligence (AGI). The seminar highlights the partnership with CS324 on advances in Foundation Models. Participants are encouraged to engage and ask questions via YouTube chat or Discord. The session promises insightful discussions on the training objectives of foundation models, their limitations, and the significance of compression in the context of AGI.

### 📊 Foundation Models and Minimum Description Length

This section delves into the concept of minimum description length (MDL) and its relevance to understanding and improving foundation models. Jack Ray discusses the historical and philosophical underpinnings of seeking the minimum description length for data compression and generalization, referencing Solonoff's theory of inductive inference. The segment also explores generative models as lossless compressors, highlighting how large language models, despite their size, excel in state-of-the-art lossless data compression.

### 🔍 Exploring Lossless Compression with Large Language Models

Jack Ray elucidates the mechanics of lossless compression in large language models through a detailed example involving LLaMA models. He demonstrates that larger models, such as the 65 billion parameter version, achieve better compression, hence suggesting superior generalization capabilities. The talk emphasizes the counterintuitive nature of large language models being efficient lossless compressors and explains the mathematical basis for evaluating the compression efficiency of these models.

### 🌐 Arithmetic Encoding and Model Training

The seminar continues with an in-depth discussion on arithmetic encoding as a method for data compression. Through a hypothetical scenario involving two individuals, Satya and Sundar, Ray illustrates how arithmetic encoding and decoding work in tandem with a generative model to achieve lossless compression of a dataset. This process underlines the non-dependence of compression efficiency on the size of the neural network but rather on the model's ability to predict next tokens accurately.

### 📈 Towards AGI: The Importance of Compression

Jack Ray outlines a two-step approach towards achieving AGI: collecting useful perceptual information and compressing it efficiently using powerful foundation models. He argues that any research method improving compression can advance capabilities towards better perception, supporting the idea with examples of how lossless compression aids in understanding and generalization. Ray also addresses common confusions regarding lossy vs. lossless compression and their implications for neural networks.

### 🚀 The Future of Compression in AI Research

In the final part of the seminar, Ray explores potential limitations and future directions for compression in AI research. He touches on practical challenges, such as the computational expense of pixel-level image modeling, and the need for novel architectures that adapt to the informational content of inputs. The discussion concludes with reflections on the integral role of compression in driving advancements in AI and the continuous pursuit of algorithmic improvements alongside computational scaling.

### Mindmap

### Keywords

### 💡Compression

### 💡Minimum Description Length

### 💡Generative Models

### 💡Arithmetic Coding

### 💡Scaling

### 💡Foundation Models

### 💡AGI (Artificial General Intelligence)

### 💡Perception

### 💡Lossless Compression

### 💡Retrieval

### Highlights

Compression is a has been a objective that actually we are generally striving towards as we build better and larger models which may be counter-intuitive, given the models themselves can be very large.

Generative models are actually lossless compressors and specifically large language models are actually state of the art lossless compressors which may be a counter-intuitive point to many people.

Race Islanders' theory of inductive inference states that if you have a universe of data generated by an algorithm and observations of that universe encoded as a data set, they are best predicted by the smallest executable Archive of that data set, known as the minimum description length.

The size of the lossless compression of a data set can be characterized as the negative log likelihood from a generative model evaluated over the data set, plus the description length of the generative model.

Generative models like large language models are state-of-the-art lossless compressors, able to compress datasets like the one used to train the 65B parameter LLaMA model by 14x compared to the original data size.

Arithmetic encoding allows mapping a token to a compressed transcript using exactly -log2(p) bits, where p is the model's predicted probability for that token. Arithmetic decoding can recover the original token from the transcript if the probability distribution is known.

Larger models trained for more compute steps tend to achieve better compression, explaining their superior generalization performance despite increased model size.

Retrieval-augmented language models that can look ahead at future tokens would be "cheating" from a compression standpoint and may fool performance metrics without true generalization gains.

Model architectures that can dynamically allocate compute based on information content, similar to how human perception works, could improve the inefficiency of current models that spend uniform compute on all inputs.

Pixel-level image and video modeling is very compute-intensive with current architectures but may be viable with architectures that can gracefully process inputs at the appropriate "thinking frequency".

The Hutter prize's small 100MB data limit failed to incentivize meaningful compression research, while the transition to large language models provided a bigger boost.

While compression is a rigorous objective, model capabilities that people fundamentally care about should be continually evaluated alongside compression metrics.

Training for multiple epochs may be justified from a compression perspective if treated as a form of replay, where only predictions on held-out data are scored.

S4 and other architectures that enable longer context lengths and adaptive computation could help model different modalities like audio and images more efficiently.

The pace of innovation in foundation models and their applications is incredibly rapid, with amazing developments expected weekly or bi-weekly in 2023.

### Transcripts

hello everyone and welcome to episode 76

of the Stanford MLS seminar series

um today of course we're or this year

we're very excited to be partnered with

cs324 advances in Foundation models

um today I'm joined by Michael say hi

and ivonica

um and today our guest is Jack Ray from

openai and he's got a very exciting talk

uh prep for us about compression and AGI

um so so we're very excited to listen to

him as always if if you have questions

you can post them in YouTube chat or if

you're in the class there's that Discord

Channel

um so so to keep the questions coming

and after his talk we will we'll have a

great discussion

um so with that Jack take it away

okay fantastic thanks a lot

and right

okay so

um today I'm going to talk about

compression for AGI and the theme of

this talk is that I want people to kind

of think deeply about uh Foundation

models and their training objective and

think deeply about kind of what are we

doing why does it make sense what are

the limitations

um

this is quite a important topic at

present I think there's a huge amount of

interest in this area in Foundation

models large language models their

applications and a lot of it is driven

very reasonably just from this principle

that it works and it works so it's

interesting but if we just kind of sit

within the kind of it works realm it's

hard to necessarily predict or have a

good intuition of why it might work or

where it might go

so some takeaways that I want so I hope

people like people hopefully to take

from this tour car some of them are

quite pragmatic so I'm going to talk

about some background on the minimum

description length and why it's seeking

the minimum description length of our

data may be an important role in solving

perception uh I want to make a

particular point that generative models

are actually lossless compressors and

specifically large language models are

actually state of the art lossless

compressors which may be a

counter-intuitive point to many people

given that they are very large and use a

lot of space and I'm going to unpack

that

in detail and then I'm also going to

kind of end on some notes of limitations

of the approach of compression

so

let's start with this background minimum

description length and why it relates to

perception so

even going right back to the kind of

ultimate goal of learning from data we

may have some set of observations that

we've collected some set of data that we

want to learn about which we consider

this small red circle

and we actually have a kind of a

two-pronged goal we want to learn like

uh how to kind of predict and understand

our observed data with the goal of

understanding and generalizing to a much

larger set of Universe of possible

observations so we can think of this as

if we wanted to learn from dialogue data

for example we may have a collection of

dialogue transcripts but we don't

actually care about only learning about

those particular dialogue transcripts we

want to then be able to generalize to

the superset of all possible valid

conversations that a model may come

across right so

what is an approach what is a very like

rigorous approach to trying to learn to

generalize well I mean this has been a

philosophical question for multiple

thousands of years

um

and even actually kind of full Century

BC uh there's like some pretty good

um principles that philosophers are

thinking about so Aristotle had this

notion of

um

assuming the super superiority of the

demonstration which derives from fewer

postulates or hypotheses so this notion

of uh we have some

[Music]

um

um simple set of hypotheses

um

then this is probably going to be a

superior description of a demonstration

now this kind of General kind of simpler

is better

um

theme is more recently attributed to

William 14th century or Cam's Razer this

is something many people may have

encountered during a machine learning or

computer science class

he is essentially continuing on this

kind of philosophical theme the simplest

of several competing explanations is

always likely likely to be the correct

one

um now I think we can go even further

than this within machine learning I

think right now Occam's razor is almost

used to defend almost every possible

angle of research but I think one

actually very rigorous incarnation of

what comes Razer is from race Island's

theory of inductive inference 1964. so

we're almost at the present day and he

says something quite concrete and

actually mathematically proven which is

that if you have a universe of data

which is generated by an algorithm and

observations of that universe so this is

the small red circle

encoded as a data set are best predicted

by the smallest executable Archive of

that data set so that says the smallest

lossless prediction or otherwise known

as the minimum description length so I

feel like that final one is actually

putting into mathematical and quite

concrete terms

um these kind of Notions that existed

through timing velocity

and it kind of we could even relate this

to a pretty I feel like that is a quite

a concrete and actionable retort to this

kind of

um quite

um murky original philosophical question

but if we even apply this to a

well-known philosophical problem cells

Chinese room 4 experiment where there's

this notion of a computer program or

even a person kind of with it within a

room that is going to perform

translation from English English to

Chinese and they're going to

specifically use a complete rulebook of

all possible

inputs or possible say English phrases

they receive and then and then the

corresponding say Chinese translation

and the original question is does this

person kind of understand how to perform

translation uh and I think actually this

compression argument this race on this

compression argument is going to give us

something quite concrete here so uh this

is kind of going back to the small red

circle large white circle if if we have

all possible translations and then we're

just following the rule book this is

kind of the least possible understanding

we can have of translation if we have

such a giant book of all possible

translations and it's quite intuitive if

we all we have to do is coin a new word

or have a new phrase or anything which

just doesn't actually fit in the

original book this system will

completely fail to translate because it

has the least possible understanding of

translation and it has the least

understandable version of translation

because that's the largest possible

representation of the the task the data

set however if we could make this

smaller maybe we kind of distill

sorry we distill this to a smaller set

of rules some grammar some basic

vocabulary and then we can execute this

program maybe such a system has a better

understanding of translation so we can

kind of grade it based on how compressed

this rulebook is and actually if we

could kind of compress it down to the

kind of minimum description like the

most compressed format the task we may

even argue such a system has the best

possible understanding of translation

um now for foundation models we

typically are in the realm where we're

talking about generator model one that

places probability on natural data and

what is quite nice is we can actually

characterize the lossless compression of

a data set using a generator model in a

very precise mathematical format so race

on enough says we should try and find

the minimum description length well we

can actually try and do this practically

with a generator model so the size the

lossless compression of our data set D

can be characterized as the negative log

likelihood from a genetic model

evaluated over D plus the description

length of this generator model so for a

neural network we can think of this as

the amount of code to initialize the

neural network

that might actually be quite small

this is not actually something that

would be influenced by the size of the

neural network this would just be the

code to actually instantiate it so it

might be a couple hundred kilobytes to

actually Implement a code base which

trains a transformer for example and

actually this is quite a surprising fact

so what does this equation tell us does

it tell us anything new well I think it

tells us something quite profound the

first thing is we want to minimize this

general property and we can do it by two

ways one is via having a generative

model which has better and better

performance of our data set that is a

lower and lower negative log likelihood

but also we are going to account for the

prior information that we inject into F

which is that we can't stuff F full of

priors such that maybe it gets better

performance but overall it does not get

a bit of a compression

um so

on that note yeah compression is a a

cool way of thinking about

how we should best model our data and

it's actually kind of a non-gameable

objective so contamination is a big

problem within uh machine learning and

trying to evaluate progress is often

hampered by Notions of whether or not

test sets are leaked into training sense

well with compression this is actually

not not something we can game so imagine

we pre-trained F on a whole data set D

such that it perfectly memorizes the

data set

AKA such that the probability of D is

one log probability is zero in such a

case if we go back to this formula the

first term will zip to zero

however now essentially by doing that by

injecting and pre-training our model on

this whole data set we have to add that

to the description length of our

generative model so now F not only

contains the code to train it Etc but it

also contains essentially a description

length of d

so in this setting essentially a

pre-contaminating f it does not help us

optimize the compression

and this contrasts to regular test set

benchmarking where we may be just

measuring test set performance and

hoping that measures generalization and

is essentially a proxy for compression

and it can be but also we can find lots

and lots of scenarios where we

essentially have variations of the test

set that have slipped through the net in

our training set and actually even right

now within Labs comparing large language

models this notion of contamination

affecting eval resources a continual

kind of phone in um in in the side of

kind of clarity

Okay so we've talked about philosophical

backing of the minimum description

length and maybe why it's a sensible

objective

and now I'm going to talk about it

concretely for large language models and

we can kind of map this to any uh

generative model but I'm just going to

kind of ground it specifically in the

marsh language model so if we think

about what is the log problem of our

data D well it's the sum of our next

token prediction of tokens over our data

set

[Music]

um

so this is something that's essentially

our training objective if we think of

our data set D

um and we have one Epoch then this is

the sum of all of our training loss so

it's pretty tangible term it's a real

thing we can measure and F is the

description length of our

Transformer language model uh and

actually there are people that have

implemented a Transformer and a training

regime just without any external

libraries in about I think 100 to 200

kilobytes so this is actually something

that's very small

um and and as I said I just want to

enunciate this this is something where

it's not dependent on the size of our

neural network so if a piece of code can

instantiate a 10 layer Transformer the

same piece of code you can just change a

few numbers in the code it can

instantiate a 1000 layer Transformer

actually the description length of our

initial Transformer is unaffected really

by how large the actual neural network

is we're going to go through an example

of actually using a language model to

losslessly compress where we're going to

see why this is the case

okay so let's just give like a specific

example and try and ground this out

further so okay llama it was a very cool

paper that came out from fair just like

late last week I was looking at the

paper here's some training curves

um now forgetting the smaller two models

there are the two largest models are

trained on one Epoch of their data set

so actually we could sum their training

losses uh AKA this quantity

and we can also roughly approximate the

size of of the um of the code base that

was used to train them

um and therefore we can see like okay

which of these two moles the 33b or the

65b is the better compressor and

therefore which would we expect to be

the better model at generalizing and

having greater set of capabilities so

it's pretty it's going to be pretty

obvious at 65b I'll tell you why firstly

just to drum this point home these

models all have the same description

length they have different number of

parameters but the code that's used to

generate them is actually of same of the

same complexity however they don't have

the same integral of the training loss

65b has a smaller integral Windows

training loss

and therefore if we plug if we sum these

two terms we would find that 65b

essentially creates the more concise

description of its training data set

okay so that might seem a little bit

weird I'm going to even plug some actual

numbers in let's say we assume it's

about one megabyte for the code to

instantiate and train the Transformer

and then if we actually just calculate

this roughly it looks to be about say

400 gigabytes

um

you have some of your log loss

converting into bits and then bytes it's

going to be something like 400 gigabytes

and this is from an original data set

which is about 5.6 terabytes of rortex

so 1.4 trillion tokens times four is

about 5.6 terabytes so that's a

compression rate of 14x

um the best text compressor on the

Hudson prize is 8.7 X so the takeaway of

this point is

um actually as we're scaling up and

we're creating more powerful models and

we're training them on more data we're

actually creating something which

actually is providing a lower and lower

lossless compression of our data even

though the intermediate model itself may

be very large

okay so now I've talked a bit about how

large language models are state of the