Compression for AGI - Jack Rae | Stanford MLSys #76
Summary
TLDRIn episode 76 of the Stanford MLS seminar series, the focus is on the intriguing intersection of compression and AGI (Artificial General Intelligence), featuring guest speaker Jack Ray from OpenAI. The talk delves into foundational models and their significant role in shaping the future of machine learning, emphasizing the importance of understanding their training objectives, limitations, and potential. Jack Ray presents a detailed exploration of compression as a key to unlocking AGI, discussing generative models as lossless compressors and highlighting the concept of minimum description length. Through this insightful discussion, the seminar sheds light on the intricate dynamics of foundation models, urging the audience to think deeply about their applications and the broader implications for AI research.
Takeaways
- 😃 Large language models like GPT-3 are state-of-the-art lossless compressors, able to compress data at rates better than traditional algorithms like gzip.
- 🤔 The minimum description length principle, which aims to find the smallest possible representation of data, has deep philosophical roots and may be key to achieving artificial general intelligence (AGI).
- 🧐 Training large language models is essentially a process of lossless compression, where the objective is to minimize the number of bits required to encode the training data.
- 💡 Scaling up model size and training data can lead to better compression and potentially improved generalization, but algorithmic advances beyond just scaling are also important.
- ⚠️ While compression is a rigorous objective, evaluating models solely on compression metrics may be uninformative, and tracking emergent capabilities is crucial.
- 🔍 Arithmetic encoding provides a way to losslessly compress data using a language model's predictions, though the process is computationally expensive.
- ✨ Architectures that can adaptively allocate compute based on input complexity, like the S4 model, may be important for efficiently compressing multi-modal data like images and audio.
- 🚧 Lossy compression, while related, is distinct from the lossless compression objective and may not lead to better generalization.
- 🔑 The description length of a model itself (e.g., the code to instantiate it) is typically small compared to the compressed data size, regardless of model scale.
- 🌱 Future breakthroughs in areas like data efficiency, adaptive compute, and new architectures could lead to further paradigm shifts in compression and generalization capabilities.
Q & A
What is the main topic of the talk?
-The main topic of the talk is compression for artificial general intelligence (AGI), and how techniques like lossless compression using large language models can potentially help in solving perception and generalization problems.
Why is the minimum description length principle important according to the speaker?
-The speaker argues that seeking the minimum description length of data may be an important principle in solving perception and generalizing well, as it has a rigorous mathematical foundation dating back to philosophers like Aristotle and William of Ockham.
How are large language models related to lossless compression?
-The speaker explains that large language models are actually state-of-the-art lossless compressors, as training them involves minimizing the negative log-likelihood over the training data, which is equivalent to lossless compression.
Can you explain the example of Satya and Sundar used to illustrate lossless compression?
-The example involves Satya encoding a dataset using a trained language model and arithmetic coding, and sending the encoded transcripts and model code to Sundar. Sundar can then reconstruct the original dataset by running the code and using arithmetic decoding with the predicted token probabilities.
What is the potential recipe for solving perception and moving towards AGI according to the speaker?
-The recipe is to first collect all useful perceptual information, and then learn to compress it as best as possible with a powerful foundation model, through techniques like scaling data and compute, or algorithmic advances.
What is the main limitation of the compression approach mentioned by the speaker?
-One limitation is that modeling and compressing everything at a low level (e.g., pixels for images) may be computationally expensive and inefficient, so some form of filtering or semantic understanding may be needed first.
How does the speaker view the role of reinforcement learning in relation to compression?
-The speaker notes that while compression is important for observable data, reinforcement learning and on-policy behavior are still crucial for gathering useful information that may not be directly observable.
What is the speaker's opinion on the Hutter Prize for lossless compression?
-The speaker believes that while the Hutter Prize aims to promote compression, it has not been fruitful because it focuses on compressing a small, fixed amount of data, underestimating the benefits of scaling data and compute.
How does the compression perspective inform the development of new architectures?
-The speaker suggests that the compression perspective could inspire research into architectures that can adapt their compute and attention based on the information content of the input, similar to how biological systems allocate resources non-uniformly.
What is the speaker's overall view on the importance of compression research?
-The speaker believes that while the compression objective provides a rigorous foundation for generalization, the primary focus should be on evaluating and tracking the emergence of new capabilities in models, as those are ultimately what people care about.
Outlines
🎉 Introduction to Compression and AGI Seminar
The Stanford MLS seminar series introduces a talk by Jack Ray from OpenAI, focusing on compression for Artificial General Intelligence (AGI). The seminar highlights the partnership with CS324 on advances in Foundation Models. Participants are encouraged to engage and ask questions via YouTube chat or Discord. The session promises insightful discussions on the training objectives of foundation models, their limitations, and the significance of compression in the context of AGI.
📊 Foundation Models and Minimum Description Length
This section delves into the concept of minimum description length (MDL) and its relevance to understanding and improving foundation models. Jack Ray discusses the historical and philosophical underpinnings of seeking the minimum description length for data compression and generalization, referencing Solonoff's theory of inductive inference. The segment also explores generative models as lossless compressors, highlighting how large language models, despite their size, excel in state-of-the-art lossless data compression.
🔍 Exploring Lossless Compression with Large Language Models
Jack Ray elucidates the mechanics of lossless compression in large language models through a detailed example involving LLaMA models. He demonstrates that larger models, such as the 65 billion parameter version, achieve better compression, hence suggesting superior generalization capabilities. The talk emphasizes the counterintuitive nature of large language models being efficient lossless compressors and explains the mathematical basis for evaluating the compression efficiency of these models.
🌐 Arithmetic Encoding and Model Training
The seminar continues with an in-depth discussion on arithmetic encoding as a method for data compression. Through a hypothetical scenario involving two individuals, Satya and Sundar, Ray illustrates how arithmetic encoding and decoding work in tandem with a generative model to achieve lossless compression of a dataset. This process underlines the non-dependence of compression efficiency on the size of the neural network but rather on the model's ability to predict next tokens accurately.
📈 Towards AGI: The Importance of Compression
Jack Ray outlines a two-step approach towards achieving AGI: collecting useful perceptual information and compressing it efficiently using powerful foundation models. He argues that any research method improving compression can advance capabilities towards better perception, supporting the idea with examples of how lossless compression aids in understanding and generalization. Ray also addresses common confusions regarding lossy vs. lossless compression and their implications for neural networks.
🚀 The Future of Compression in AI Research
In the final part of the seminar, Ray explores potential limitations and future directions for compression in AI research. He touches on practical challenges, such as the computational expense of pixel-level image modeling, and the need for novel architectures that adapt to the informational content of inputs. The discussion concludes with reflections on the integral role of compression in driving advancements in AI and the continuous pursuit of algorithmic improvements alongside computational scaling.
Mindmap
Keywords
💡Compression
💡Minimum Description Length
💡Generative Models
💡Arithmetic Coding
💡Scaling
💡Foundation Models
💡AGI (Artificial General Intelligence)
💡Perception
💡Lossless Compression
💡Retrieval
Highlights
Compression is a has been a objective that actually we are generally striving towards as we build better and larger models which may be counter-intuitive, given the models themselves can be very large.
Generative models are actually lossless compressors and specifically large language models are actually state of the art lossless compressors which may be a counter-intuitive point to many people.
Race Islanders' theory of inductive inference states that if you have a universe of data generated by an algorithm and observations of that universe encoded as a data set, they are best predicted by the smallest executable Archive of that data set, known as the minimum description length.
The size of the lossless compression of a data set can be characterized as the negative log likelihood from a generative model evaluated over the data set, plus the description length of the generative model.
Generative models like large language models are state-of-the-art lossless compressors, able to compress datasets like the one used to train the 65B parameter LLaMA model by 14x compared to the original data size.
Arithmetic encoding allows mapping a token to a compressed transcript using exactly -log2(p) bits, where p is the model's predicted probability for that token. Arithmetic decoding can recover the original token from the transcript if the probability distribution is known.
Larger models trained for more compute steps tend to achieve better compression, explaining their superior generalization performance despite increased model size.
Retrieval-augmented language models that can look ahead at future tokens would be "cheating" from a compression standpoint and may fool performance metrics without true generalization gains.
Model architectures that can dynamically allocate compute based on information content, similar to how human perception works, could improve the inefficiency of current models that spend uniform compute on all inputs.
Pixel-level image and video modeling is very compute-intensive with current architectures but may be viable with architectures that can gracefully process inputs at the appropriate "thinking frequency".
The Hutter prize's small 100MB data limit failed to incentivize meaningful compression research, while the transition to large language models provided a bigger boost.
While compression is a rigorous objective, model capabilities that people fundamentally care about should be continually evaluated alongside compression metrics.
Training for multiple epochs may be justified from a compression perspective if treated as a form of replay, where only predictions on held-out data are scored.
S4 and other architectures that enable longer context lengths and adaptive computation could help model different modalities like audio and images more efficiently.
The pace of innovation in foundation models and their applications is incredibly rapid, with amazing developments expected weekly or bi-weekly in 2023.
Transcripts
hello everyone and welcome to episode 76
of the Stanford MLS seminar series
um today of course we're or this year
we're very excited to be partnered with
cs324 advances in Foundation models
um today I'm joined by Michael say hi
and ivonica
um and today our guest is Jack Ray from
openai and he's got a very exciting talk
uh prep for us about compression and AGI
um so so we're very excited to listen to
him as always if if you have questions
you can post them in YouTube chat or if
you're in the class there's that Discord
Channel
um so so to keep the questions coming
and after his talk we will we'll have a
great discussion
um so with that Jack take it away
okay fantastic thanks a lot
and right
okay so
um today I'm going to talk about
compression for AGI and the theme of
this talk is that I want people to kind
of think deeply about uh Foundation
models and their training objective and
think deeply about kind of what are we
doing why does it make sense what are
the limitations
um
this is quite a important topic at
present I think there's a huge amount of
interest in this area in Foundation
models large language models their
applications and a lot of it is driven
very reasonably just from this principle
that it works and it works so it's
interesting but if we just kind of sit
within the kind of it works realm it's
hard to necessarily predict or have a
good intuition of why it might work or
where it might go
so some takeaways that I want so I hope
people like people hopefully to take
from this tour car some of them are
quite pragmatic so I'm going to talk
about some background on the minimum
description length and why it's seeking
the minimum description length of our
data may be an important role in solving
perception uh I want to make a
particular point that generative models
are actually lossless compressors and
specifically large language models are
actually state of the art lossless
compressors which may be a
counter-intuitive point to many people
given that they are very large and use a
lot of space and I'm going to unpack
that
in detail and then I'm also going to
kind of end on some notes of limitations
of the approach of compression
so
let's start with this background minimum
description length and why it relates to
perception so
even going right back to the kind of
ultimate goal of learning from data we
may have some set of observations that
we've collected some set of data that we
want to learn about which we consider
this small red circle
and we actually have a kind of a
two-pronged goal we want to learn like
uh how to kind of predict and understand
our observed data with the goal of
understanding and generalizing to a much
larger set of Universe of possible
observations so we can think of this as
if we wanted to learn from dialogue data
for example we may have a collection of
dialogue transcripts but we don't
actually care about only learning about
those particular dialogue transcripts we
want to then be able to generalize to
the superset of all possible valid
conversations that a model may come
across right so
what is an approach what is a very like
rigorous approach to trying to learn to
generalize well I mean this has been a
philosophical question for multiple
thousands of years
um
and even actually kind of full Century
BC uh there's like some pretty good
um principles that philosophers are
thinking about so Aristotle had this
notion of
um
assuming the super superiority of the
demonstration which derives from fewer
postulates or hypotheses so this notion
of uh we have some
[Music]
um
um simple set of hypotheses
um
then this is probably going to be a
superior description of a demonstration
now this kind of General kind of simpler
is better
um
theme is more recently attributed to
William 14th century or Cam's Razer this
is something many people may have
encountered during a machine learning or
computer science class
he is essentially continuing on this
kind of philosophical theme the simplest
of several competing explanations is
always likely likely to be the correct
one
um now I think we can go even further
than this within machine learning I
think right now Occam's razor is almost
used to defend almost every possible
angle of research but I think one
actually very rigorous incarnation of
what comes Razer is from race Island's
theory of inductive inference 1964. so
we're almost at the present day and he
says something quite concrete and
actually mathematically proven which is
that if you have a universe of data
which is generated by an algorithm and
observations of that universe so this is
the small red circle
encoded as a data set are best predicted
by the smallest executable Archive of
that data set so that says the smallest
lossless prediction or otherwise known
as the minimum description length so I
feel like that final one is actually
putting into mathematical and quite
concrete terms
um these kind of Notions that existed
through timing velocity
and it kind of we could even relate this
to a pretty I feel like that is a quite
a concrete and actionable retort to this
kind of
um quite
um murky original philosophical question
but if we even apply this to a
well-known philosophical problem cells
Chinese room 4 experiment where there's
this notion of a computer program or
even a person kind of with it within a
room that is going to perform
translation from English English to
Chinese and they're going to
specifically use a complete rulebook of
all possible
inputs or possible say English phrases
they receive and then and then the
corresponding say Chinese translation
and the original question is does this
person kind of understand how to perform
translation uh and I think actually this
compression argument this race on this
compression argument is going to give us
something quite concrete here so uh this
is kind of going back to the small red
circle large white circle if if we have
all possible translations and then we're
just following the rule book this is
kind of the least possible understanding
we can have of translation if we have
such a giant book of all possible
translations and it's quite intuitive if
we all we have to do is coin a new word
or have a new phrase or anything which
just doesn't actually fit in the
original book this system will
completely fail to translate because it
has the least possible understanding of
translation and it has the least
understandable version of translation
because that's the largest possible
representation of the the task the data
set however if we could make this
smaller maybe we kind of distill
sorry we distill this to a smaller set
of rules some grammar some basic
vocabulary and then we can execute this
program maybe such a system has a better
understanding of translation so we can
kind of grade it based on how compressed
this rulebook is and actually if we
could kind of compress it down to the
kind of minimum description like the
most compressed format the task we may
even argue such a system has the best
possible understanding of translation
um now for foundation models we
typically are in the realm where we're
talking about generator model one that
places probability on natural data and
what is quite nice is we can actually
characterize the lossless compression of
a data set using a generator model in a
very precise mathematical format so race
on enough says we should try and find
the minimum description length well we
can actually try and do this practically
with a generator model so the size the
lossless compression of our data set D
can be characterized as the negative log
likelihood from a genetic model
evaluated over D plus the description
length of this generator model so for a
neural network we can think of this as
the amount of code to initialize the
neural network
that might actually be quite small
this is not actually something that
would be influenced by the size of the
neural network this would just be the
code to actually instantiate it so it
might be a couple hundred kilobytes to
actually Implement a code base which
trains a transformer for example and
actually this is quite a surprising fact
so what does this equation tell us does
it tell us anything new well I think it
tells us something quite profound the
first thing is we want to minimize this
general property and we can do it by two
ways one is via having a generative
model which has better and better
performance of our data set that is a
lower and lower negative log likelihood
but also we are going to account for the
prior information that we inject into F
which is that we can't stuff F full of
priors such that maybe it gets better
performance but overall it does not get
a bit of a compression
um so
on that note yeah compression is a a
cool way of thinking about
how we should best model our data and
it's actually kind of a non-gameable
objective so contamination is a big
problem within uh machine learning and
trying to evaluate progress is often
hampered by Notions of whether or not
test sets are leaked into training sense
well with compression this is actually
not not something we can game so imagine
we pre-trained F on a whole data set D
such that it perfectly memorizes the
data set
AKA such that the probability of D is
one log probability is zero in such a
case if we go back to this formula the
first term will zip to zero
however now essentially by doing that by
injecting and pre-training our model on
this whole data set we have to add that
to the description length of our
generative model so now F not only
contains the code to train it Etc but it
also contains essentially a description
length of d
so in this setting essentially a
pre-contaminating f it does not help us
optimize the compression
and this contrasts to regular test set
benchmarking where we may be just
measuring test set performance and
hoping that measures generalization and
is essentially a proxy for compression
and it can be but also we can find lots
and lots of scenarios where we
essentially have variations of the test
set that have slipped through the net in
our training set and actually even right
now within Labs comparing large language
models this notion of contamination
affecting eval resources a continual
kind of phone in um in in the side of
kind of clarity
Okay so we've talked about philosophical
backing of the minimum description
length and maybe why it's a sensible
objective
and now I'm going to talk about it
concretely for large language models and
we can kind of map this to any uh
generative model but I'm just going to
kind of ground it specifically in the
marsh language model so if we think
about what is the log problem of our
data D well it's the sum of our next
token prediction of tokens over our data
set
[Music]
um
so this is something that's essentially
our training objective if we think of
our data set D
um and we have one Epoch then this is
the sum of all of our training loss so
it's pretty tangible term it's a real
thing we can measure and F is the
description length of our
Transformer language model uh and
actually there are people that have
implemented a Transformer and a training
regime just without any external
libraries in about I think 100 to 200
kilobytes so this is actually something
that's very small
um and and as I said I just want to
enunciate this this is something where
it's not dependent on the size of our
neural network so if a piece of code can
instantiate a 10 layer Transformer the
same piece of code you can just change a
few numbers in the code it can
instantiate a 1000 layer Transformer
actually the description length of our
initial Transformer is unaffected really
by how large the actual neural network
is we're going to go through an example
of actually using a language model to
losslessly compress where we're going to
see why this is the case
okay so let's just give like a specific
example and try and ground this out
further so okay llama it was a very cool
paper that came out from fair just like
late last week I was looking at the
paper here's some training curves
um now forgetting the smaller two models
there are the two largest models are
trained on one Epoch of their data set
so actually we could sum their training
losses uh AKA this quantity
and we can also roughly approximate the
size of of the um of the code base that
was used to train them
um and therefore we can see like okay
which of these two moles the 33b or the
65b is the better compressor and
therefore which would we expect to be
the better model at generalizing and
having greater set of capabilities so
it's pretty it's going to be pretty
obvious at 65b I'll tell you why firstly
just to drum this point home these
models all have the same description
length they have different number of
parameters but the code that's used to
generate them is actually of same of the
same complexity however they don't have
the same integral of the training loss
65b has a smaller integral Windows
training loss
and therefore if we plug if we sum these
two terms we would find that 65b
essentially creates the more concise
description of its training data set
okay so that might seem a little bit
weird I'm going to even plug some actual
numbers in let's say we assume it's
about one megabyte for the code to
instantiate and train the Transformer
and then if we actually just calculate
this roughly it looks to be about say
400 gigabytes
um
you have some of your log loss
converting into bits and then bytes it's
going to be something like 400 gigabytes
and this is from an original data set
which is about 5.6 terabytes of rortex
so 1.4 trillion tokens times four is
about 5.6 terabytes so that's a
compression rate of 14x
um the best text compressor on the
Hudson prize is 8.7 X so the takeaway of
this point is
um actually as we're scaling up and
we're creating more powerful models and
we're training them on more data we're
actually creating something which
actually is providing a lower and lower
lossless compression of our data even
though the intermediate model itself may
be very large
okay so now I've talked a bit about how
large language models are state of the