Compression for AGI - Jack Rae | Stanford MLSys #76

Stanford MLSys Seminars
27 Feb 202359:53

Summary

TLDR这段视频脚本是斯坦福大学机器学习系统研讨会的一次演讲记录。演讲者Jack Ray来自OpenAI,他阐述了压缩在通向人工通用智能(AGI)道路上的重要性。他解释了描述长度最小化原则如何与概括和感知能力相关,并探讨了生成式模型实际上是无损压缩器。此外,他分析了大型语言模型作为最先进无损压缩器的潜力和局限性。Ray为追求AGI的努力提供了一种新颖且富有洞见的视角。

Takeaways

  • 😄 最小描述长度(MDL)原则旨在寻求对数据集的最佳无损压缩,这可能是解决感知问题的关键所在。
  • 😃 生成模型实际上是无损压缩器,大型语言模型目前是最先进的无损压缩器。
  • 🤔 通过算术编码,我们可以用概率分布对数据进行无损压缩和解压。
  • 🧐 更大的模型往往具有更好的无损压缩能力,这与常见的"更小的模型可能会更好地泛化"的观点相矛盾。
  • 😮 压缩目标是不可欺骗的,即使在预训练后也不会对其产生影响。
  • 🤓 检索未来令牌会破坏压缩目标,因此应该避免这种做法。
  • 😎 考虑将更多的先验知识纳入神经网络的初始化过程,类似于生物DNA。
  • 🤯 对于图像和视频等数据,目前的变压器架构在计算效率上存在局限性。
  • 🙃 仅关注压缩本身可能是不够的,我们还需要评估模型的实际能力。
  • 😁 2023年可能会是基础模型和下游应用大规模创新的一年。

Q & A

  • 这场讲座的主题是什么?

    -这场讲座的主题是压缩和人工通用智能(AGI)。杰克·雷从OpenAI讨论了如何利用压缩来解决感知和推进AGI的进展。

  • 为什么最小描述长度(MDL)原理与学习数据以及推广到更广阔的观察领域有关?

    -根据雷斯理论,如果一个数据集是由一个算法生成的,那么该数据集最佳的预测模型就是该数据集的可执行的最小存档(即最小描述长度)。这种原理将学习数据连接到推广到更广阔观察领域的能力。

  • 生成式模型到底如何实现无损压缩?

    -生成模型通过预测数据中的下一个标记来最小化负对数似然,从而对数据进行无损压缩。通过初始化生成模型所需的代码和预测下一个标记的对数损失之和就可以重建整个数据集。

  • 大型语言模型如何成为最先进的无损压缩器?

    -随着训练数据和计算能力的增加,大型语言模型能够持续降低对数损失,从而提高对原始数据的压缩率。目前最先进的大型语言模型实际上比传统压缩算法具有更高的压缩率。

  • 评估模型时,为什么压缩客观更好而不容易被'作弊'?

    -与常见的测试集评估不同,如果一个模型试图通过记忆全部训练数据来'作弊',其实并没有改进压缩率,因为模型描述长度需要包含原始训练数据。因此,压缩目标是无法被游戏化的。

  • 使用压缩目标来训练模型有何潜在挑战?

    -一个潜在挑战是这种方法可能效率低下,因为它试图压缩所有可观察的信息,而一些信息可能并不关键。另一个挑战是无法压缩无法观察到的信息,而这些信息对于AGI至关重要。

  • 作者认为未来在压缩和AGI领域有哪些可能的突破?

    -可能的突破包括更好的模型架构、更多的数据和计算规模、利用工具和检索辅助模型学习、以及更高效的注意力机制等。总的来说,作者认为任何改进模型压缩率的技术都将推进AGI的进展。

  • 压缩在图像和视频领域会遇到什么挑战?

    -目前的Transformer架构无法灵活地适应不同模态和分辨率的信息,对于高分辨率图像和视频就会效率很低。需要新的架构来优雅地处理不同频率的视觉和音频信号。

  • 直接针对压缩目标进行研究是否重要?

    -作者认为,虽然压缩目标具有坚实的理论基础,但我们更应该持续评估模型的实际能力,而压缩只是一个严格的元分数线。压缩本身并不应该成为研究的主要目标。

  • 作者对2023年在人工智能领域有何展望?

    -作者预计2023年将每周甚至每两周就会有令人兴奋的突破和创新出现,无论是新的大型模型还是下游应用研究,创新的步伐将会非常迅速。他呼吁我们为这一年做好准备。

Outlines

00:00

🎙️ 开场白和介绍

本视频是斯坦福 MLS 研讨会的第76集,与 CS324 高级基础模型课程合作。本期嘉宾是来自 OpenAI 的 Jack Ray,他将就压缩和 AGI 的主题进行精彩演讲。会后将有讨论环节,欢迎在 YouTube 或 Discord 频道提出问题。

05:01

🗜️ 最小描述长度和压缩目标

Ray 阐述了寻求数据的最小描述长度在解决感知问题中的重要作用,以及生成模型实际上是无损压缩器的观点。他指出,在训练目标上加入描述长度因素,可以防止过度注入先验信息,让模型更好地泛化。这给出了评估模型表现的非可操纵性指标。

10:01

🧩 大型语言模型是最先进的无损压缩器

Ray 解释了如何看待大型语言模型目前所达到的巨大压缩率,即使它们本身规模庞大。他举例说明 65B 参数的模型相较 33B 的压缩性更好,因为它整合了最小化训练损失和最小模型描述长度两方面因素。总的来说,大型语言模型提供了比最佳现有压缩算法更好的数据压缩率。

15:01

📜 算术编码演示了如何无损压缩

Ray 详细阐述了如何利用算术编码理论演示语言模型的无损压缩过程。发送方将模型代码和压缩的对数概率值传输,接收方则重现相同的训练过程来解码原始数据。尽管在实践中很昂贵,但这证明了无损压缩数据集的等价过程。

20:02

🎯 感知问题的解决方案

Ray 认为解决感知问题的两步曲是:收集所有相关感知信息,并使用强大的基础模型尽可能压缩它。改进压缩率的任何研究进展都将推进人工智能对感知的理解能力。他还讨论了当前以像素级建模为例的一些局限性。

25:03

🔄 算术编码解码过程的进一步说明

在一个提问环节中,Ray 进一步解释了算术编码的解码过程在语言模型中是如何工作的。通过具体演示了一个逐步重建训练过程并最终还原出原始数据的想象场景,使得编解码算法的运作方式更加清晰。

30:04

🤔 有关压缩方法的一些疑问

在回答了一些疑问后,Ray 阐明了压缩目标为什么是一个很好的非操纵性目标,以及如何看待目前一个时期的训练方式。他认为多时期训练本身并不违背压缩目标,只要用正确的方式计算损失。最后,他重申无论采用何种方法,核心都是通过提高压缩率来提升对数据集和感知的泛化能力。

35:05

🔭 未来的机会和挑战

Ray 谈到了这种压缩观点对于今后训练架构和模型设计的启发。他认为有望通过适应性计算、注意力分配等方式来更好地处理不同模态数据,并探索更高效的编码方式。同时他也指出,无法观测到的隐式信息可能会成为这种方法的局限性。

40:05

🧐 数据压缩和 AGI 的未来展望

Ray 最后强调,压缩目标虽然严谨而值得努力,但我们最终还是要关注模型实际展现出的能力。展望未来,2023年将会是充满革新和令人兴奋的一年,行业内各方面创新层出不穷。压缩可能并非唯一的出路,但其理论意义和指导作用不容忽视。

45:06

💬 问答与加深理解

在回答提问过程中,Ray 进一步阐明了压缩目标的合理性及应用范围,揭示了更好的架构如何从算术编码实验中获得启发,以及阐释了目前大型模型压缩数据的真实情况和实现思路。他呼吁研究人员对压缩理论有更透彻的理解。

50:06

🌟 编码数据大格式和其他可能性

针对编码整个大型数据集时所需开销的问题,Ray 指出在实践中不会完全按照这种耗费昂贵的算术编码方式操作,而是可以采取更高效的近似算法。他同时承认,对于缺失信息的补充仍有待于其他原则和启发式方法来解决。

55:08

🔚 研讨会结束

在最后部分,Ray 总结了本次演讲的重点内容,并感谢在场和在线的所有与会者。主持人也对本季的演讲嘉宾表示感谢,并预告了下一期的话题。

Mindmap

Keywords

💡最小描述长度

最小描述长度(MDL)是一种将数据压缩成最小可能描述的原则,相当于压缩数据的同时最大限度地减少信息损失。该理论认为,最佳模型应该能够对数据进行有效压缩,同时保持足够的信息量以准确预测和理解数据。该概念在视频中被用作衡量模型泛化能力的标准,因为一个优秀的模型应该能够压缩训练数据,同时能够推广到更广阔的数据领域。

💡生成模型

生成模型是一种可以为数据生成概率分布的模型,常被应用于自然语言处理、计算机视觉等领域。视频中提到,大型生成语言模型实际上是目前最先进的无损压缩器,因为它们能够高效地对文本数据建模并生成相应的概率分布,从而实现有效压缩。生成模型的目标是最小化训练数据的描述长度,这与压缩的目标是一致的。

💡压缩

压缩是指用较少的比特来表示信息的过程。压缩在视频中被认为是实现人工通用智能(AGI)的关键,因为良好的压缩意味着对数据有更好的理解和泛化能力。具体地,视频认为我们应该收集所有有用的感知信息,并使用强大的基础模型对其进行最佳压缩。压缩目标提供了一种学习方式,可以避免过度拟合和对测试集进行欺骗性优化。

💡算术编码

算术编码是一种熵编码技术,可以将符号序列无损地转换为码字。该过程使用概率分布来编码每个符号,并且编码长度与符号概率成反比。视频中使用算术编码来说明如何利用生成语言模型对数据进行无损压缩和解压缩。通过模拟发送方和接收方之间使用算术编码传送数据的过程,阐述了生成模型等价于无损压缩器的原理。

💡人工通用智能

人工通用智能(AGI)是指具有类似于人类的通用学习和推理能力的人工智能系统。视频认为,压缩是实现AGI的关键途径,因为一个优秀的压缩系统能够从感官数据中提取关键信息,并对其进行高效建模和泛化。通过最小化训练数据的描述长度,生成模型正朝着更强的感知和理解能力迈进,这是实现AGI的重要一步。

💡泛化

泛化是指模型对从未见过的新数据的推广和适用能力。视频中的压缩理论认为,越小的描述长度意味着对训练数据有更好的泛化和理解能力。例如,生成语言模型通过产生更精确的下一词预测概率分布来缩小描述长度,从而达到更好地泛化到所有可能的对话场景的目的。泛化是机器学习模型的核心目标之一。

💡损失函数

损失函数用于衡量模型输出与真实标签之间的差异,是机器学习模型训练中需要最小化的指标。在视频中,损失函数被用于计算最小描述长度公式中的对数似然项。具体来说,对于语言模型而言,损失函数是所有时间步的下一词预测损失之和,等价于数据的描述长度。通过最小化损失函数,模型可以达到最优的压缩率,即最小的描述长度。

💡过度拟合

过度拟合是指模型过于复杂,以至于将训练数据中的噪声也学习掉,导致泛化能力较差。视频中提到,压缩方法可以避免过度拟合,因为最小描述长度原则不会对包含噪声的数据进行优化压缩。此外,如果模型通过记忆训练数据来降低损失,其描述长度反而会增加,这也会被视为过度拟合的情况。因此,适当的压缩可以规避过度拟合,提高模型的泛化性能。

💡量化

量化是一种用于减少模型大小和推理时间的技术,通过降低权重和激活值的数值精度来节省存储空间和计算资源。在视频中,量化被认为是一种可能的研究方向,有望提高压缩率。视频提到,目前神经网络的初始化代码大小非常小,但如果能通过量化进一步降低描述长度,就可以提高压缩率,推动感知和理解能力的提升。

💡多模态

多模态是指同时处理多种类型的数据输入,如文本、图像、视频和音频等。视频指出,目前的Transformer架构在处理不同模态时效率不佳,因为它对每个标记都使用相同的计算量,而不考虑其信息含量的差异。为了更好地压缩多模态数据,需要设计出能够自适应分配计算资源的新型架构。S4等新架构被视为朝这一方向迈出的重要一步。

Highlights

Compression is a principled objective that generative models, including large language models, are inherently optimizing for, even though the models themselves can be very large.

Solving perception and moving towards AGI involves two steps: 1) Collect all useful perceptual information, and 2) Learn to compress it as best as possible with a powerful foundation model.

The minimum description length principle, rooted in philosophy and formalized by Solomonoff's theory of inductive inference, provides a rigorous mathematical foundation linking compression to generalization and intelligence.

Generative models like large language models are actually state-of-the-art lossless compressors, a counterintuitive fact given their large size.

Compression is a non-gameable objective, unlike test set benchmarking which can be susceptible to contamination.

The description length of a neural network is relatively small, determined by the code needed to instantiate it, rather than scaling with the model size.

Scaling data and compute has been a major driver for improved compression and capabilities in language models, but algorithmic advances are also crucial open research problems.

Retrieval over future tokens during training would be "cheating" from a compression perspective, although it may improve test metrics.

The compression objective does not equate to training for only one epoch; replay and multi-epoch training can be valid if only scoring predictions on held-out data.

While compression provides a rigorous objective, evaluating models based on their capabilities is still crucial, as compression itself is an "alien" metric disconnected from human utility.

Lossy compression is different from the lossless compression objective of generative models, and neural networks are currently inefficient for lossy compression.

Architectures like Sparse Transformer may help adapt compute based on information content, addressing a current limitation where models spend uniform compute regardless of input complexity.

Biological systems exhibit sparse, non-uniform computation, suggesting potential benefits from architectures that can adaptively allocate compute resources.

For modalities like images and video, pixel-level modeling with current architectures is wasteful, but this limitation should not rule out the possibility of effective compression with better future architectures.

While compression research is important for theoretical foundations, the emergence of new capabilities enabled by better compression is ultimately the focus for driving AI progress.

Transcripts

00:02

hello everyone and welcome to episode 76

00:06

of the Stanford MLS seminar series

00:08

um today of course we're or this year

00:10

we're very excited to be partnered with

00:12

cs324 advances in Foundation models

00:15

um today I'm joined by Michael say hi

00:19

and ivonica

00:21

um and today our guest is Jack Ray from

00:24

openai and he's got a very exciting talk

00:26

uh prep for us about compression and AGI

00:30

um so so we're very excited to listen to

00:32

him as always if if you have questions

00:34

you can post them in YouTube chat or if

00:36

you're in the class there's that Discord

00:37

Channel

00:38

um so so to keep the questions coming

00:40

and after his talk we will we'll have a

00:42

great discussion

00:43

um so with that Jack take it away

00:47

okay fantastic thanks a lot

00:52

and right

00:56

okay so

00:58

um today I'm going to talk about

01:00

compression for AGI and the theme of

01:02

this talk is that I want people to kind

01:05

of think deeply about uh Foundation

01:09

models and their training objective and

01:12

think deeply about kind of what are we

01:14

doing why does it make sense what are

01:17

the limitations

01:18

um

01:19

this is quite a important topic at

01:22

present I think there's a huge amount of

01:25

interest in this area in Foundation

01:27

models large language models their

01:28

applications and a lot of it is driven

01:31

very reasonably just from this principle

01:33

that it works and it works so it's

01:34

interesting but if we just kind of sit

01:37

within the kind of it works realm it's

01:40

hard to necessarily predict or have a

01:43

good intuition of why it might work or

01:45

where it might go

01:48

so some takeaways that I want so I hope

01:50

people like people hopefully to take

01:52

from this tour car some of them are

01:54

quite pragmatic so I'm going to talk

01:57

about some background on the minimum

01:58

description length and why it's seeking

02:01

the minimum description length of our

02:03

data may be an important role in solving

02:05

perception uh I want to make a

02:08

particular point that generative models

02:10

are actually lossless compressors and

02:12

specifically large language models are

02:15

actually state of the art lossless

02:16

compressors which may be a

02:19

counter-intuitive point to many people

02:20

given that they are very large and use a

02:23

lot of space and I'm going to unpack

02:25

that

02:26

in detail and then I'm also going to

02:29

kind of end on some notes of limitations

02:32

of the approach of compression

02:35

so

02:37

let's start with this background minimum

02:38

description length and why it relates to

02:40

perception so

02:42

even going right back to the kind of

02:44

ultimate goal of learning from data we

02:48

may have some set of observations that

02:50

we've collected some set of data that we

02:52

want to learn about which we consider

02:55

this small red circle

02:57

and we actually have a kind of a

03:00

two-pronged goal we want to learn like

03:02

uh how to kind of predict and understand

03:05

our observed data with the goal of

03:09

understanding and generalizing to a much

03:10

larger set of Universe of possible

03:12

observations so we can think of this as

03:16

if we wanted to learn from dialogue data

03:19

for example we may have a collection of

03:21

dialogue transcripts but we don't

03:23

actually care about only learning about

03:25

those particular dialogue transcripts we

03:27

want to then be able to generalize to

03:29

the superset of all possible valid

03:31

conversations that a model may come

03:33

across right so

03:36

what is an approach what is a very like

03:38

rigorous approach to trying to learn to

03:41

generalize well I mean this has been a

03:43

philosophical question for multiple

03:45

thousands of years

03:47

um

03:48

and even actually kind of full Century

03:51

BC uh there's like some pretty good

03:53

um principles that philosophers are

03:56

thinking about so Aristotle had this

03:59

notion of

04:00

um

04:02

assuming the super superiority of the

04:04

demonstration which derives from fewer

04:06

postulates or hypotheses so this notion

04:09

of uh we have some

04:11

[Music]

04:11

um

04:12

um simple set of hypotheses

04:15

um

04:16

then this is probably going to be a

04:18

superior description of a demonstration

04:21

now this kind of General kind of simpler

04:23

is better

04:25

um

04:26

theme is more recently attributed to

04:29

William 14th century or Cam's Razer this

04:33

is something many people may have

04:34

encountered during a machine learning or

04:36

computer science class

04:38

he is essentially continuing on this

04:40

kind of philosophical theme the simplest

04:42

of several competing explanations is

04:44

always likely likely to be the correct

04:46

one

04:47

um now I think we can go even further

04:50

than this within machine learning I

04:52

think right now Occam's razor is almost

04:54

used to defend almost every possible

04:56

angle of research but I think one

04:58

actually very rigorous incarnation of

05:00

what comes Razer is from race Island's

05:04

theory of inductive inference 1964. so

05:06

we're almost at the present day and he

05:08

says something quite concrete and

05:09

actually mathematically proven which is

05:11

that if you have a universe of data

05:13

which is generated by an algorithm and

05:15

observations of that universe so this is

05:17

the small red circle

05:19

encoded as a data set are best predicted

05:21

by the smallest executable Archive of

05:23

that data set so that says the smallest

05:25

lossless prediction or otherwise known

05:28

as the minimum description length so I

05:30

feel like that final one is actually

05:31

putting into mathematical and quite

05:33

concrete terms

05:34

um these kind of Notions that existed

05:37

through timing velocity

05:38

and it kind of we could even relate this

05:40

to a pretty I feel like that is a quite

05:43

a concrete and actionable retort to this

05:46

kind of

05:47

um quite

05:48

um murky original philosophical question

05:51

but if we even apply this to a

05:52

well-known philosophical problem cells

05:54

Chinese room 4 experiment where there's

05:57

this notion of a computer program or

05:58

even a person kind of with it within a

06:01

room that is going to perform

06:02

translation from English English to

06:05

Chinese and they're going to

06:07

specifically use a complete rulebook of

06:10

all possible

06:12

inputs or possible say English phrases

06:15

they receive and then and then the

06:16

corresponding say Chinese translation

06:18

and the original question is does this

06:20

person kind of understand how to perform

06:22

translation uh and I think actually this

06:24

compression argument this race on this

06:26

compression argument is going to give us

06:28

something quite concrete here so uh this

06:31

is kind of going back to the small red

06:32

circle large white circle if if we have

06:35

all possible translations and then we're

06:38

just following the rule book this is

06:39

kind of the least possible understanding

06:41

we can have of translation if we have

06:42

such a giant book of all possible

06:44

translations and it's quite intuitive if

06:46

we all we have to do is coin a new word

06:49

or have a new phrase or anything which

06:50

just doesn't actually fit in the

06:52

original book this system will

06:54

completely fail to translate because it

06:56

has the least possible understanding of

06:58

translation and it has the least

06:59

understandable version of translation

07:02

because that's the largest possible

07:03

representation of the the task the data

07:06

set however if we could make this

07:08

smaller maybe we kind of distill

07:12

sorry we distill this to a smaller set

07:13

of rules some grammar some basic

07:15

vocabulary and then we can execute this

07:17

program maybe such a system has a better

07:19

understanding of translation so we can

07:21

kind of grade it based on how compressed

07:23

this rulebook is and actually if we

07:24

could kind of compress it down to the

07:27

kind of minimum description like the

07:28

most compressed format the task we may

07:30

even argue such a system has the best

07:32

possible understanding of translation

07:35

um now for foundation models we

07:38

typically are in the realm where we're

07:39

talking about generator model one that

07:40

places probability on natural data and

07:43

what is quite nice is we can actually

07:44

characterize the lossless compression of

07:46

a data set using a generator model in a

07:48

very precise mathematical format so race

07:51

on enough says we should try and find

07:53

the minimum description length well we

07:55

can actually try and do this practically

07:57

with a generator model so the size the

08:00

lossless compression of our data set D

08:02

can be characterized as the negative log

08:05

likelihood from a genetic model

08:06

evaluated over D plus the description

08:09

length of this generator model so for a

08:14

neural network we can think of this as

08:15

the amount of code to initialize the

08:17

neural network

08:18

that might actually be quite small

08:21

this is not actually something that

08:23

would be influenced by the size of the

08:24

neural network this would just be the

08:26

code to actually instantiate it so it

08:29

might be a couple hundred kilobytes to

08:31

actually Implement a code base which

08:32

trains a transformer for example and

08:35

actually this is quite a surprising fact

08:37

so what does this equation tell us does

08:40

it tell us anything new well I think it

08:42

tells us something quite profound the

08:44

first thing is we want to minimize this

08:46

general property and we can do it by two

08:48

ways one is via having a generative

08:51

model which has better and better

08:52

performance of our data set that is a

08:54

lower and lower negative log likelihood

08:55

but also we are going to account for the

08:58

prior information that we inject into F

09:01

which is that we can't stuff F full of

09:04

priors such that maybe it gets better

09:06

performance but overall it does not get

09:08

a bit of a compression

09:10

um so

09:12

on that note yeah compression is a a

09:15

cool way of thinking about

09:17

how we should best model our data and

09:19

it's actually kind of a non-gameable

09:21

objective so contamination is a big

09:24

problem within uh machine learning and

09:27

trying to evaluate progress is often

09:29

hampered by Notions of whether or not

09:31

test sets are leaked into training sense

09:33

well with compression this is actually

09:36

not not something we can game so imagine

09:39

we pre-trained F on a whole data set D

09:42

such that it perfectly memorizes the

09:44

data set

09:45

AKA such that the probability of D is

09:48

one log probability is zero in such a

09:51

case if we go back to this formula the

09:53

first term will zip to zero

09:56

however now essentially by doing that by

09:58

injecting and pre-training our model on

10:01

this whole data set we have to add that

10:03

to the description length of our

10:04

generative model so now F not only

10:06

contains the code to train it Etc but it

10:08

also contains essentially a description

10:10

length of d

10:11

so in this setting essentially a

10:12

pre-contaminating f it does not help us

10:15

optimize the compression

10:18

and this contrasts to regular test set

10:20

benchmarking where we may be just

10:22

measuring test set performance and

10:24

hoping that measures generalization and

10:26

is essentially a proxy for compression

10:27

and it can be but also we can find lots

10:30

and lots of scenarios where we

10:31

essentially have variations of the test

10:33

set that have slipped through the net in

10:35

our training set and actually even right

10:37

now within Labs comparing large language

10:40

models this notion of contamination

10:42

affecting eval resources a continual

10:45

kind of phone in um in in the side of

10:48

kind of clarity

10:49

Okay so we've talked about philosophical

10:52

backing of the minimum description

10:54

length and maybe why it's a sensible

10:56

objective

10:58

and now I'm going to talk about it

10:59

concretely for large language models and

11:01

we can kind of map this to any uh

11:04

generative model but I'm just going to

11:06

kind of ground it specifically in the

11:07

marsh language model so if we think

11:10

about what is the log problem of our

11:11

data D well it's the sum of our next

11:14

token prediction of tokens over our data

11:18

set

11:19

[Music]

11:19

um

11:20

so this is something that's essentially

11:22

our training objective if we think of

11:24

our data set D

11:25

um and we have one Epoch then this is

11:28

the sum of all of our training loss so

11:30

it's pretty tangible term it's a real

11:31

thing we can measure and F is the

11:33

description length of our

11:35

Transformer language model uh and

11:38

actually there are people that have

11:39

implemented a Transformer and a training

11:41

regime just without any external

11:43

libraries in about I think 100 to 200

11:45

kilobytes so this is actually something

11:47

that's very small

11:49

um and and as I said I just want to

11:51

enunciate this this is something where

11:53

it's not dependent on the size of our

11:55

neural network so if a piece of code can

11:57

instantiate a 10 layer Transformer the

12:00

same piece of code you can just change a

12:02

few numbers in the code it can

12:03

instantiate a 1000 layer Transformer

12:05

actually the description length of our

12:07

initial Transformer is unaffected really

12:10

by how large the actual neural network

12:13

is we're going to go through an example

12:15

of actually using a language model to

12:16

losslessly compress where we're going to

12:18

see why this is the case

12:21

okay so let's just give like a specific

12:23

example and try and ground this out

12:25

further so okay llama it was a very cool

12:28

paper that came out from fair just like

12:29

late last week I was looking at the

12:32

paper here's some training curves

12:34

um now forgetting the smaller two models

12:37

there are the two largest models are

12:39

trained on one Epoch of their data set

12:41

so actually we could sum their training

12:43

losses uh AKA this quantity

12:47

and we can also roughly approximate the

12:50

size of of the um of the code base that

12:53

was used to train them

12:56

um and therefore we can see like okay

12:58

which of these two moles the 33b or the

13:00

65b is the better compressor and

13:01

therefore which would we expect to be

13:03

the better model at generalizing and

13:05

having greater set of capabilities so

13:09

it's pretty it's going to be pretty

13:11

obvious at 65b I'll tell you why firstly

13:13

just to drum this point home these

13:16

models all have the same description

13:17

length they have different number of

13:18

parameters but the code that's used to

13:20

generate them is actually of same of the

13:23

same complexity however they don't have

13:25

the same integral of the training loss

13:28

65b has a smaller integral Windows

13:31

training loss

13:32

and therefore if we plug if we sum these

13:35

two terms we would find that 65b

13:36

essentially creates the more concise

13:39

description of its training data set

13:42

okay so that might seem a little bit

13:43

weird I'm going to even plug some actual

13:44

numbers in let's say we assume it's

13:46

about one megabyte for the code to

13:48

instantiate and train the Transformer

13:50

and then if we actually just calculate

13:53

this roughly it looks to be about say

13:55

400 gigabytes

13:57

um

13:58

you have some of your log loss

13:59

converting into bits and then bytes it's

14:02

going to be something like 400 gigabytes

14:03

and this is from an original data set

14:06

which is about 5.6 terabytes of rortex

14:08

so 1.4 trillion tokens times four is

14:11

about 5.6 terabytes so that's a

14:13

compression rate of 14x

14:15

um the best text compressor on the

14:17

Hudson prize is 8.7 X so the takeaway of

14:20

this point is

14:21

um actually as we're scaling up and

14:24

we're creating more powerful models and

14:25

we're training them on more data we're

14:27

actually creating something which

14:29

actually is providing a lower and lower

14:31

lossless compression of our data even

14:34

though the intermediate model itself may

14:36

be very large

14:40

okay so now I've talked a bit about how

14:43

large language models are state of the

14:45

art lossless compressors but I just want

14:47

to maybe go through the mechanics of how

14:49

do we actually get a something like a

14:51

generative model literally losslessly

14:53

compress this may be something that's

14:55

quite mysterious like what is happening

14:57

like

14:57

when you actually losslessly compress

14:59

this thing is it the weights or is it

15:01

something else

15:02

so I'm going to give us a hypothetical

15:04

kind of scenario we have two people sat

15:07

here in Sundar Satya wants to send a

15:09

data set of the world's knowledge

15:10

encoded in D to send R they both have

15:13

access to very powerful supercomputers

15:15

but there's a low bandwidth connection

15:17

we are going to use a trick called

15:19

arithmetic encoding as a way of

15:22

communicating the data set so say we

15:24

have a token x a timestep t from of some

15:27

vocab and a probability distribution p

15:29

over tokens

15:31

arithmetic encoding without going into

15:33

the nuts and bolts is a way of allowing

15:35

us to map our token x given our

15:38

probability distribution over tokens to

15:41

some Z

15:43

where Z is essentially our compressed

15:46

transcripts of data and Z is going to

15:49

use exactly minus log 2 p t x t bits so

15:54

the point of this step is like

15:58

arithmetic encoding actually Maps it to

16:00

some kind of like floating Point number

16:01

as it turns out and it's a real

16:04

algorithm this is like something that

16:05

exists in the real world it does require

16:08

technically infinite Precision to to use

16:10

exactly these number of bits and

16:12

otherwise you maybe you're going to pay

16:14

a small cost for implementation but it's

16:16

roughly approximately optimal in terms

16:19

of the encoding and we can use

16:20

arithmetic decoding

16:22

um to take this encrypted transcript and

16:25

as long as we have our probability

16:26

distribution of tokens we can then

16:28

recover the original token so we can

16:30

think about probability probability

16:32

distribution as kind of like a key it

16:34

can allow us to kind of lock in a

16:36

compressed copy of our token and then

16:38

unlock it

16:39

so if p is uniform so there's no

16:42

information about our tokens then this

16:45

would be this one over v p is just one

16:47

over the size of V so we can use log 2 V

16:49

bits of space uh that is just

16:52

essentially the same as naively storing

16:53

in binary uh our our XT token if p is an

16:58

oracle so it knows like exactly what the

17:00

token was going to be

17:01

so P of x equals one then log 2p equals

17:05

zero and this uses zero space so these

17:08

are the two extremes and obviously what

17:10

we want is a generative model which

17:11

better and better molds our data and

17:13

therefore it uses less space

17:15

so what would actually happen in

17:17

practice if Satya can take his data set

17:20

of tokens trainer Transformer and get a

17:23

subsequent set of probabilities uh over

17:27

the tokens like so next token prediction

17:29

and then use arithmetic encoding to map

17:32

it to this list of transcripts and this

17:34

is going to be of size sum of negative

17:37

log likelihood of your Transformer over

17:39

the data set

17:40

and he's also going to send he's going

17:42

to send that list of transcripts and

17:44

some code that can deterministically

17:46

train a larger Transformer

17:48

and so

17:49

he sends those two things what does that

17:52

equal in practice the size of f the size

17:54

of your generator model description plus

17:57

the size of your some of your negative

17:59

log likelihood of your data set so as

18:02

you can see it doesn't matter whether

18:04

the Transformer was one billion

18:06

parameters one trillion parameters

18:09

plus plus he's not actually sending the

18:12

neural network he's sending the

18:13

transcript of encoded logits plus the

18:17

code

18:18

and then on the other side Sundar can

18:20

run this code which is deterministic and

18:22

the mod is going to run the neural

18:24

network it gives a probability

18:25

distribution to the first token he's

18:27

going to use arithmetic decoding with

18:29

that to get his first token you can

18:31

either train on that or whatever the

18:32

code does so then continue on

18:35

predict the next token etc etc and

18:37

essentially

18:39

iteratively go through and recover the

18:41

whole data set

18:42

um so this is kind of like almost a

18:44

fourth experiment because in practice to

18:46

send this data at 14x compressed

18:48

compression say if we're talking about

18:50

the Llama model uh that's it's a bit

18:52

more compressed than gzip but this is

18:54

requiring a huge amount of intermediate

18:56

compute switches to train a large

18:58

language model which feels inhibitive

19:00

but this thought experiment is really

19:02

derived not because we actually might

19:04

want to send data on a smaller and

19:07

smaller bandwidth it's also just derived

19:09

to kind of explain and prove why we can

19:12

actually losslessly compress with

19:14

language models and why that is their

19:16

actual objective

19:18

um and if this kind of setup feels a

19:21

little bit contrived well the fun fact

19:23

is this is the exact setup that called

19:25

Shannon was thinking about

19:26

um when he kind of proposed language

19:28

models in the 40s he was thinking about

19:30

having a discrete set of data and how

19:33

can we better communicate to overload

19:35

over a low bandwidth Channel and

19:37

language models and entropy coding

19:39

essentially was the topic that he was

19:41

thinking about about labs

19:46

Okay so we've talked mechanically about

19:48

well we've talked about the philosophy

19:50

of kind of why do why why be interested

19:53

in description length relating it to

19:55

generalization talks about why

19:57

generative models are lossless

19:59

compressors talked about why our current

20:02

large language models are actually

20:03

state-of-the-art lossless compressors

20:05

than our providing some of the most

20:07

compressed representations of our source

20:09

data so let's just think about solving

20:12

perception and moving towards AGI what's

20:14

the recipe well it's kind of a two-step

20:16

process one is collect all useful

20:19

perceptual information that we want to

20:21

understand and the second is learn to

20:23

compress it as best as possible with a

20:25

powerful Foundation model

20:26

so the nice thing about this is it's not

20:29

constrained to a particular angle for

20:32

example you can use any research method

20:34

that improves compression and I would

20:36

posit that this will further Advance our

20:38

capabilities towards perception based on

20:41

this rigorous foundation so that might

20:43

be a better architecture it may be scale

20:45

further scaling of data and computes

20:48

this is in fact something that's almost

20:49

become a meme people say scale is all

20:52

you need but truly I think scale is only

20:56

going to benefit as long as it is

20:57

continuing to significantly improve

21:00

compression but you could any use any

21:02

other technique and this doesn't have to

21:04

be just a regular generative model it

21:06

could even we could even maybe spend a

21:08

few more bits on the description length

21:10

of F and add in some tools add in things

21:12

like a calculator allow it to make use

21:15

of tools to better predict its data

21:16

allow it to retrieve over the past use

21:19

its own synthetic data to generate and

21:21

then learn better there's many many

21:22

angles we could think about that are

21:25

within the scope of a model

21:27

better better compressing it Source data

21:29

to generalize over the universe of

21:30

possible observations

21:33

I just want to remark at this point on a

21:36

very common point of confusion on this

21:38

topic which is about lossy compression

21:40

so I think it's a very reasonable

21:43

um

21:44

thought to maybe confuse what a neural

21:47

network is doing with glossy compression

21:49

especially because

21:51

information naturally seeps in from the

21:54

source training data into the weights of

21:56

a neural network and neural network can

21:58

often memorize it often does memorize

21:59

and can repeat many things that it's

22:01

seen but it doesn't repeat everything

22:03

perfectly so it's lossy and it's also

22:05

kind of a terrible lossy compression

22:07

algorithm so if in the velocity

22:09

compression case you would actually be

22:12

transmitting the weights of the

22:14

parameters of a neural network and they

22:16

can often actually be larger than your

22:17

Source data so I think there's a very

22:19

interesting New Yorker article about

22:21

about this kind of Topic in general kind

22:23

of thinking about you know what are what

22:25

are language models doing what are

22:26

Foundation models doing and I think

22:28

there's a lot of confusion in this

22:30

article specifically on this topic where

22:32

from the perspective of glossy

22:35

compression

22:36

and neural network feels very kind of

22:38

sub-optimal it's losing information in

22:40

Red so it doesn't even do reconstruction

22:42

very well and it's potentially bloated

22:44

and larger and has all these other

22:46

properties

22:47

I just wanted to take this kind of

22:49

point to reflect

22:51

on the original goal which is we really

22:53

care about understanding and

22:55

generalizing to the space of the

22:57

universe of possible observations so we

22:59

don't care and we don't train towards

23:01

reconstructing our original data

23:04

um I think if we did then this article

23:08

basically concludes like if we did just

23:10

care about reconstructing this original

23:11

data like why do we even train over it

23:13

why not just keep the original data as

23:15

it is and I think that's a very valid

23:16

point uh but if we care instead about

23:19

loss like a lossless compression of this

23:22

then essentially this talk is about

23:25

linking that to this wider problem of

23:27

generalizing to many many different

23:29

types of unseen data

23:34

great so I've talked about

23:37

the mechanics of compression with

23:40

language models and linking it to this

23:42

confusion of velocity compression what

23:45

are some limitations that I think are

23:46

pretty valid

23:48

um so I think

23:50

there's one concern with this approach

23:52

which is that it may be just the right

23:55

thing to do or like an unbiased kind of

23:58

attempt at solving perception but maybe

24:00

it's just not very pragmatic and

24:03

actually trying to kind of model

24:04

everything and compress everything it

24:06

may be kind of correct but very

24:07

inefficient so I think Image level

24:09

modeling is a good example of this where

24:12

modeling a whole image at the pixel

24:14

level has often kind of been

24:16

prohibitively expensive to like work

24:18

incredibly well and therefore people

24:21

have changed the objective or or ended

24:23

up modeling a slightly

24:25

more semantic level

24:28

um and I think even if it maybe seems

24:31

plausible now we can go back to pixel

24:32

level image modeling and maybe we just

24:34

need to tweak the architecture if we

24:35

turn this to video modeling every pixel

24:37

of every frame it really feels

24:39

preemptively crazy and expensive so one

24:42

limitation is you know maybe we do need

24:44

to kind of first filter like what are

24:46

what are all the pieces of information

24:47

that we know we definitely are still

24:49

keeping and we want to model but then

24:51

try and have some way like filtering out

24:53

the extraneous communicate computation

24:55

the the kind of bits of information we

24:57

just don't need and then maybe we can

24:59

then filter out to a much smaller subset

25:01

and then and then we losslessly compress

25:03

that

25:04

um

25:05

another very valid point is I think this

25:08

is often framed uh to people that maybe

25:11

are thinking that this is like the only

25:13

ingredient for AGI is that crucially

25:15

there's lots of just very useful

25:17

information in the world that is not

25:18

observable and therefore we can't just

25:21

expect to compress all observable

25:24

observations achieve AGI because

25:26

there'll just be lots of things we're

25:27

missing out

25:28

um so I think a good example of this

25:30

would be something like Alpha zero so

25:33

playing the game of Go

25:35

um

25:36

I think if you just observe the limited

25:38

number of human games that have ever

25:40

existed one thing that you're missing is

25:42

all of the intermediate search trees of

25:44

all of these expert players and one nice

25:46

thing about something like Alpha zero

25:47

with its kind of self-play mechanism is

25:49

you essentially get to collect lots of

25:51

data of intermediate search trees of

25:53

many many different types of games

25:55

um so that kind of on policy behavior of

25:57

like actually having an agent that can

25:59

act and then Source out the kind of data

26:00

that it needs I think is still very

26:02

important so and in no way kind of

26:04

diminishing uh the importance of RL or

26:06

on policy kind of behavior

26:09

um but I think yeah for for everything

26:11

that we can observe

26:13

um that this is kind of like the

26:15

compression story ideally applies

26:19

great so going to conclusions

26:22

um

26:24

so compression is a has been a objective

26:28

that actually we are generally striving

26:30

towards as we build better and larger

26:32

models which may be counter-intuitive

26:34

given the models themselves can be very

26:36

large

26:37

um

26:38

the most known entity right now the one

26:41

on a lot of people's minds to better

26:43

compression is actually scale scaling

26:45

compute

26:46

um and and maybe even scaling memory but

26:49

scale isn't all you need there are many

26:51

algorithmic advances out there that I

26:54

think very interesting research problems

26:55

and

26:57

and if we look back uh basically all of

27:00

the major language modeling advances

27:02

have been synonymous with far greater

27:04

text compression so even going back from

27:07

uh the creation of engram models on pen

27:10

and paper and then kind of bringing them

27:12

into computers and then having like kind

27:14

of computerized huge tables of engram

27:16

statistics of language this kind of

27:18

opened up the ability for us to do

27:21

um things like speech to text with a

27:23

reasonable accuracy

27:25

um bringing that system to uh deep

27:29

learning via rnns has allowed us to have

27:32

much more fluent text that can span

27:34

paragraphs and then actually be

27:35

applicable to tasks like translation and

27:39

then in the recent era of large-scale

27:41

Transformers we're able to further

27:43

extend the context and extend the model

27:46

capabilities via compute such that we

27:50

are now in this place where we're able

27:52

to use

27:53

language models and Foundation models in

27:55

general

27:57

um to understand very very long spans of

27:59

text and to be able to create incredibly

28:01

useful or incredibly tailored incredibly

28:03

interesting

28:04

um Generations so I think this is going

28:07

to extend but it's a big and interesting

28:10

open problem uh what are going to be the

28:12

advances to kind of give us further

28:15

Paradigm shifts in this kind of

28:16

compression uh improved compression

28:21

right so

28:22

um yeah this talk is generally just a

28:24

rehash for the message of

28:26

former and current colleagues of mine

28:27

especially Marcus to Alex Graves Joel

28:30

Vanessa so I just want to acknowledge

28:32

them and uh thanks a lot for listening

28:34

I'm looking forward to uh chatting about

28:36

some questions

28:38

great thanks so much Jack

28:41

um I'm actually going to ask you to keep

28:42

your slides on the screen because I

28:44

think we had some uh questions about uh

28:48

just kind of uh understanding the

28:51

um some some of the mathematical

28:53

statements in the talk so I think it

28:55

would be helpful to to kind of go go

28:56

back over some of the slides yeah I

29:00

think uh some people were confused a bit

29:02

by the arithmetic decoding

29:05

um so in particular uh maybe it'll be

29:07

useful to to go back to discussion of

29:09

the arithmetic decoding and uh I think

29:11

people are a bit confused about

29:13

um how is it possible for the receiver

29:16

to decode the message and get the

29:19

original data set back without having

29:21

access to the train bottle

29:23

yeah

29:25

um well okay

29:27

um I'll do in two steps so one let's

29:30

just imagine they don't have the fully

29:32

trained model that they have a partially

29:33

trained model

29:35

and so they are able to get a next token

29:37

prediction

29:38

and then

29:40

um

29:40

they have the the receiver also has some

29:44

of the encoded transcripts at T this

29:46

allows them I guess maybe here in the

29:49

case of language modeling this would

29:51

look like XT plus one say if it was like

29:52

PT Plus one but anyway

29:54

um this may allow them to recover the

29:57

next token and then they're going to

29:59

build it up in this way so maybe I'll

30:01

just delay on this particular Slide the

30:04

idea it would look like is we we the

30:06

receiver does not receive the neural

30:08

network it just receives the code to

30:09

instantiate kind of the fresh neural

30:11

network and run the identical training

30:14

setup that it saw before and obviously

30:16

the training setup as it saw before

30:18

we're going to imagine like batch size

30:19

of one one token at a time just for

30:21

Simplicity so uh and let's just imagine

30:24

maybe there's like a beginning of text

30:27

token here first so

30:29

so the receiver so now he just has to

30:31

run the code at first there's nothing to

30:33

decode yet there's no tokens and there's

30:35

a fresh neural network uh that's going

30:37

to give us like a probability

30:39

distribution for the first token and so

30:41

he's got this probability distribution

30:43

for the first token and he's got the

30:44

transcript

30:46

um of what that token should be and you

30:48

can use arithmetic decoding to actually

30:49

recover that first token

30:51

and then let's imagine for Simplicity we

30:54

actually like train like one SGD step on

30:56

one token at a time so we take our SGD

30:58

step and then we have the model that's

31:01

like was used to predict the next token

31:03

so we can get that P2 we have Z2 and

31:06

then we can recover X2 so now we've

31:09

recovered two tokens and we can

31:10

essentially do this iteratively

31:12

essentially reproduce this whole

31:15

training procedure on the receiving side

31:17

and dur as we reproduce the whole

31:19

training procedure we actually recover

31:21

the whole data set

31:23

yeah so it's a crazy expensive way of

31:27

actually encrypt like uh compressing

31:30

data and it might feel once again like

31:32

oh but since we're not going to

31:34

literally do that it's too expensive why

31:36

do I need to learn about it and this

31:38

really is just a way of it's like a

31:41

proof by Construction in case

31:44

um you were like you know is this

31:46

actually true like is the lossless

31:48

compressed D actually equal to this and

31:50

it's like yeah like here's how we

31:51

literally can do it and it's just the

31:53

reason we don't do it in practice is

31:54

because it would be very expensive but

31:56

there's nothing actually stopping us

31:57

it's not like completely theoretical

31:59

idea yeah

32:02

okay so all right so to kind of maybe

32:06

I'll try to explain it back to you and

32:08

then um if people on the chat and the uh

32:12

Discord shell of questions

32:14

um they they can ask and then we can we

32:16

can get some clarifications so basically

32:18

you're saying you initialize a model

32:21

um you have it do like some beginning of

32:24

token thing and it'll predict what what

32:26

it thinks the first uh what the first

32:29

token should be

32:30

um and then you use arithmetic encoding

32:33

to somehow say okay here's the here's

32:35

the prediction and then we're going to

32:37

correct it to the the actual what the

32:39

actual token is so that Z1 has enough

32:42

information to figure out what that

32:44

actual first token is yeah and then you

32:46

use that first token run one step of SGD

32:49

predict you know get the probability

32:51

distribution for the second one now you

32:54

have enough information to decode uh the

32:57

the second thing like maybe

32:59

you know uh yeah uh it's like take the

33:03

ARG Max but you know take the the third

33:05

rmx or Max or something like that

33:08

um and then so you're saying that that

33:10

is enough information to reconstruct the

33:13

the data set D exactly yeah

33:17

okay great great so uh yeah so I I

33:21

personally you know I understand a bit

33:23

better now and that that also makes

33:24

sense why the model

33:26

um you know the the model weights and

33:28

the the size of the model are not uh

33:31

actually part of that that compression

33:34

um one question that that I also had

33:36

while

33:38

um you know uh talking through that

33:40

explanation so how does that you know

33:43

compression now go back and uh how's

33:46

that related to the loss curve that you

33:48

get

33:49

um at the end of training is it that the

33:52

better your model is by the end of

33:53

training then you need to communicate

33:54

less information just like I don't know

33:56

take art Max or something like that so I

33:58

just want to say yeah like this is a

34:00

Formula if we look at this this is

34:02

basically pretty much the size of your

34:04

arithmetic encoded transcript

34:07

and this is you like your the log

34:09

negative log likelihood of your next

34:10

token prediction at every step so let's

34:13

just imagine this was batch size one

34:15

this is literally the sum

34:18

of every single training loss point

34:20

because it and the summing under a curve

34:23

this is like the integral into the Curve

34:26

so this

34:27

this value equals this and I did I did

34:30

it just by summing under this curve so

34:32

it's like a completely real quantity you

34:34

get you actually even are getting from

34:37

your training curve

34:38

so it's a little bit different to just

34:40

the final training loss it's the

34:42

integral during the whole training

34:43

procedure

34:46

great so okay and then yeah

34:49

we can think of during training we're

34:51

going along and let's imagine we're in

34:53

the one Epoch scenario we're going along

34:55

and then every single step we're

34:56

essentially get a new kind of out of uh

34:59

out of sample like a new

35:02

sequence to try and predict and then all

35:04

we care about is trying to predict that

35:06

as best as possible and then continuing

35:08

that process and actually what we care

35:10

about is essentially all predictions

35:12

equally and trying to get the whole

35:14

thing to learn like either faster

35:15

initially and then to a lower value or

35:18

however we want we just want to minimize

35:20

this integral and basically what this

35:22

formula says it can minimize this

35:23

integral we should get something that's

35:24

essentially better and better

35:26

understands uh the data or at least

35:28

generalizes better and better

35:31

gotcha okay cool

35:34

um all right so uh let me see I think

35:36

now is a good time to end the screen

35:38

share

35:39

great okay cool

35:41

um and now uh we can go to to some more

35:44

questions uh in the in the class so

35:47

there there were a couple questions

35:48

around

35:50

um kind of uh what does this compression

35:53

uh Viewpoint allow you to do so there's

35:56

a couple questions on so has this mdl

35:59

perspective kind of

36:01

um informed the ways that you would that

36:03

we train models now or any of the

36:05

architectures that we've done now yeah

36:07

can I I think the most like immediate

36:09

one is that it clarifies a long-standing

36:12

point of confusion even within the

36:14

academic Community which is

36:16

um people don't really understand why a

36:19

larger model that seems to even

36:22

um

36:22

like why should it not be the case

36:25

that's smaller neural network less

36:26

parameters generalizes better I think

36:28

people have taken

36:30

um

36:31

like principles from like when they

36:33

study linear models and they were

36:34

regularized to have like less parameters

36:36

and there was some bounds like VC bounds

36:39

on

36:40

um

36:41

generalization and there was this

36:43

General notion of like less parameters

36:44

is what outcomes razor refers to

36:47

um one perspective this helps is a like

36:50

I think it frees up our mind of like

36:51

what is the actual objective that we

36:53

should expect to optimize towards that

36:56

will actually get us the thing we want

36:57

which is better generalization so for me

37:00

that's the most important one even on

37:02

Twitter I see it like professors in

37:05

machine learning occasionally you'll see

37:07

like they'll say some like smaller

37:08

models are more intelligent than larger

37:10

models kind of it's kind of almost like

37:12

a weird

37:13

um

37:14

um Motif that is not very rigorous so I

37:17

think one thing that's useful about this

37:19

argument is there's a pretty like

37:22

like strong like mathematical link all

37:24

the way down it goes like it starts at

37:26

solynoff's theory of induction which is

37:28

proven and then we have like a actual

37:31

mathematical link to an objective and

37:35

then

37:36

yeah it kind of like to lossless

37:38

compression and then it all kind of

37:39

links up so

37:41

um yeah I think another example would

37:43

even be like this this very I think it's

37:45

a great article but like the Ted Chang

37:46

article on uh lossless compression which

37:49

people haven't read I still recommend

37:50

reading I think

37:52

once you're not quite in a world where

37:54

like you have like a well-justified uh

37:57

motivation for doing something then

37:59

there's like lots of kind of confusion

38:01

about whether or not this whole approach

38:03

is even reasonable

38:04

um yeah so I think for me a lot of it's

38:07

about guidance but then on a more

38:09

practical level

38:10

um there are things that you can do that

38:11

would essentially kind of break uh you

38:13

would stop doing compression and you

38:15

might not notice it and then I think

38:17

this also guides you to like not do that

38:19

and I'll give you one example which is

38:21

something I've worked on personally

38:22

which is retrieval so for retrieval

38:24

augmented language models you can maybe

38:26

retrieve your whole training set and

38:28

then use that to try and improve your

38:30

predictions as you're going through now

38:32

if we think about compression one thing

38:34

that you can't do one thing that would

38:35

essentially cheating would be allow

38:37

yourself to retrieve over like future

38:39

tokens that you have not seen yet

38:41

um if you do that it's obvious like um

38:44

it might not be obvious immediately

38:45

because it was a tricky setup but in my

38:47

kind of like Satya Sundar encoding

38:50

decoding setup if you had some system

38:52

which can look to the Future that just

38:53

like won't work with that encoding

38:55

decoding setup and it also essentially

38:58

is cheating and

39:00

um

39:01

yeah so I think

39:02

essentially it's something which would

39:04

it could help your like test set

39:06

performance it might even make your

39:07

training loss look smaller but it

39:09

actually didn't improve your compression

39:11

and potentially you could fool yourself

39:12

into

39:14

um into like expecting a much larger

39:16

performance Improvement than you end up

39:17

getting in practice so I think sometimes

39:20

like you can help yourself

39:22

try and like set yourself up for

39:23

something that should actually

39:24

generalize better and do better on

39:25

Downstream evals than

39:28

um by kind of like thinking about this

39:31

kind of training objective

39:33

I see it also probably informs the type

39:37

of architectures you want to try because

39:39

if you're uh I think that that comments

39:42

about like the size of the code being

39:43

important is was really interesting

39:45

because if you need you know 17

39:47

different layers and every other uh and

39:50

every other a different module in every

39:52

layer or something that that kind of

39:54

increases the amount of information that

39:56

you need to communicate over

39:59

um yeah yeah

40:02

um it can be I could imagine on that

40:04

note like right now our setup is

40:07

essentially the prior information we put

40:09

into neural networks it's actually kind

40:10

of minuscule really and obviously

40:13

um with biological beings we have like

40:16

DNA we have like prior as like kind of

40:18

stored information which is is at least

40:20

larger than really what um the kind of

40:23

prize that we put into um

40:25

and neural networks I mean one thing

40:27

when I was first going through this I

40:29

was thinking maybe there should be more

40:31

kind of learned information that we

40:32

transfer between neural networks more of

40:35

a kind of like DNA

40:37

um and maybe like I mean we initialize

40:39

neural networks right now essentially

40:40

like gaussian noise with some a few

40:42

properties but like maybe if there was

40:44

some kind of like learned initialization

40:45

that we distill over many many different

40:46

types of ways of training neural

40:47

networks that wouldn't add to our size

40:50

of f too much but it might like mean

40:51

learning is just much faster so yeah

40:53

hopefully also the perspective might

40:55

like kind of spring out kind of

40:56

different and unique and creative like

40:58

themes of research

41:01

okay

41:02

um there there's another interesting

41:04

question from the class about the uses

41:06

of this kind of compression angle

41:09

um and the question is uh could could

41:12

the compression be good in some way by

41:14

allowing us to gain like what sorts of

41:16

higher level understanding or Focus

41:18

um on the important signal in the data

41:20

might we be able to get from the

41:23

um uh from from the lossy compression so

41:26

if we could like for example better

41:28

control the information being lost would

41:30

that allow us to gain any sort of higher

41:32

level understanding

41:34

um about kind of what what's important

41:35

in the data

41:38

um

41:40

so I think

41:44

that there is like a theme of research

41:46

trying to

41:48

um use essentially just like

41:51

the compressibility of data as at least

41:54

as a proxy for like quality

41:56

so that's one like very concrete theme

41:58

uh like

42:00

I mean this is pretty standard

42:02

pre-processing trick but

42:04

if your like data is just uncompressible

42:06

with a very simple text Express like

42:08

Giza as a data preprocessing tool then

42:11

maybe it's just like kind of random

42:12

noise and maybe you don't want to spend

42:13

any compute training or a large

42:15

Foundation model over it similarly I

42:18

think there's been

42:19

pieces of work there's a paper from 2010

42:22

that was like intelligent selection of

42:23

language model pre-training data or

42:25

something by Lewis and Moore and in that

42:27

one they look at

42:29

um they're trying to like select

42:30

training data that will be maximally

42:32

useful

42:33

um

42:33

for some Downstream tasks and

42:35

essentially what they do is they look at

42:37

like what data is best compressed

42:41

um when going from just like a regular

42:44

pre-trained language model to one that's

42:46

been specialized on that Downstream task

42:48

and they use that as a metric for data

42:49

selection they found that's like a very

42:50

good way of like selecting your data if

42:53

you just care about

42:56

training on a subset of your

42:57

pre-training data for a given Downstream

42:59

task so I think there's been some yeah

43:02

some kind of

43:03

sign of life in that area

43:08

um so uh one interesting question from

43:12

uh from the class

43:14

um so uh kind of related to uh I guess

43:18

how we code the the models versus how

43:21

they're actually executed yeah

43:23

um so uh so obviously when we write our

43:26

python code especially you know in pi

43:27

Torchic it all gets compiled down to

43:29

like Cuda kernels and and whatnot

43:32

um so how does that kind of like affect

43:34

uh your your understanding of how like

43:39

how much information is actually like in

43:41

the in these code like do you have to

43:43

take into account like the 17 different

43:45

Cuda kernels that you're running through

43:46

throughout the throughout the year yeah

43:48

this is a great question uh so um I

43:50

actually oh yeah I've got to mention

43:52

that in the talk but basically I do have

43:54

a link in the slides if the slides

43:55

eventually get shared there is a link

43:56

but I am basing

43:59

um what was quite convenient was there

44:01

is a Transformer code Base called nncp

44:04

which is like no dependencies on

44:05

anything it's just just like a I think a

44:08

single C plus plus

44:10

self-contained Library which builds a

44:12

Transformer and trains it and has a few

44:14

tricks in it like it has drop out has

44:16

like data shuffling things and that is

44:18

like 200 kilobytes like whole

44:20

self-contained so that is a good like

44:23

I'm using that as a bit of a proxy

44:25

obviously it the size of f is kind of

44:28

hard to

44:29

know for sure

44:31

um it's easy to overestimate like if you

44:34

um packaged up your like python code

44:37

like and you're using pi torch or

44:39

tensorflow it's going to import all

44:40

these libraries which aren't actually

44:41

relevant you'll you might have like

44:43

something really big you might have like

44:44

hundreds of megabytes for a gigabyte of

44:46

all this like packaged stuff together

44:48

and you might think oh therefore the

44:51

description my Transformer is actually

44:52

like you know hundreds of megabytes so

44:54

I'm just it was convenient that someone

44:56

specifically tried to

44:59

um find out how small we can make this

45:00

and they did it by building it

45:03

um from scratch eventually

45:05

cool

45:07

um we also had a question about the

45:09

hutter prize

45:10

um which I believe you you had something

45:12

in your side so the question is uh so it

45:15

appears that our largest language models

45:17

can now compress things better than

45:19

um than than the than the best header

45:21

prize so your question is is this

45:23

challenge still relevant

45:25

um yeah could you actually use the

45:27

algorithm that you suggest

45:29

um for for the hunter price yeah I'll

45:31

tell you exactly

45:32

um I mean this is something I've talked

45:33

with Marcus Hunter about the hood

45:35

surprise is like actually asking people

45:37

to do exactly the right thing but the

45:38

main issue was they it was focused on

45:41

compressing quite a small amount of data

45:43

and that date that amount of data was

45:45

fixed 100 megabytes now a lot of this

45:48

kind of perceptual roadmap is like

45:49

there's been a huge amount of benefit in

45:51

increasing

45:53

the amount of data and compute in

45:55

simultaneous

45:57

um and that and and by doing that we're

46:00

able to like continue like this training

46:02

loss curve is like getting lower you're

46:03

like

46:04

um your compression rates improving so

46:08

I would say the prize itself has not

46:11

um has just not been fruitful in like

46:14

actually promoting compression and

46:17

instead what ended up being the

46:18

Breakthrough was kind of like Bert slash

46:21

gpt2 which I think

46:24

um it's steered people to the benefit of

46:26

simultaneously essentially adopting this

46:28

workflow without necessarily naming its

46:30

compression

46:32

um I think yeah I think the Benchmark

46:33

just due to the compute limitations it

46:36

also requires it's very like outdated

46:37

something like needs like 100 maximum of

46:40

100 CPUs or something for like 48 hours

46:43

so I think essentially it didn't end up

46:45

creating an amazing like AI algorithm

46:47

but it was just because it really

46:49

underestimated the benefit of compute

46:51

like compute memory all that stuff it

46:54

turns out that's a big part of the story

46:56

of building powerful models so does that

47:00

reveal something about our current large

47:02

data sets that you kind of need to see

47:04

all this data before you can start

47:06

compressing the rest of it well yeah I

47:09

think well the cool thing is like

47:12

because the compression is the integral

47:14

in theory if you could have some

47:15

algorithm which you could learn faster

47:16

like initially that would actually have

47:19

better compression and it would be

47:21

something that you would expect it as a

47:24

result therefore that would suggest it

47:25

would kind of be a more intelligent

47:26

system and yeah I think like having

47:28

better data efficiency

47:30

is something we should really think

47:33

about strongly and I think there's

47:34

actually quite a lot of potential core

47:36

research to try and learn more from less

47:39

data uh and right now we're in

47:42

especially a lot of the big Labs I mean

47:44

there's a lot of data out there to to

47:46

kind of collect so I think maybe people

47:48

have just prioritized for now like oh it

47:50

feels like it's almost almost kind of

47:51

like an endless real data so we just

47:53

keep adding more data but then I think

47:55

there's without a doubt going to be a

47:56

lot more research focused on making more

47:58

of the data that we have

48:00

right

48:01

I wonder if you can speculate a little

48:03

bit about what this starts to look like

48:05

in I guess images and video I think you

48:09

had a slider or two at the end where

48:12

um well like as you mentioned that uh if

48:15

your data is not super g-zippable

48:18

um then that maybe there's a lot of

48:20

noise and uh I believe

48:22

um and and my intuition may be wrong but

48:24

I believe that images and or certainly

48:28

images they they appear to be a lot

48:30

larger a lot bigger than

48:33

um than than text so that doesn't have

48:36

these properties I've got a few useful

48:38

thoughts on this okay so one is we

48:41

currently have a huge limitation in our

48:42

architecture which is a Transformer or

48:45

even just like a deep content and that

48:47

is that the architecture does not adapt

48:50

in any way to the information content of

48:53

its inputs so what I mean by that is if

48:56

you have

48:57

[Music]

48:58

um

48:59

even if we have a bite level sequence of

49:03

Text data but we just represent it as

49:04

the bytes of a utf-8 and then instead we

49:07

have a bpe tokenized sequence and it

49:09

contains the exact same information but

49:11

it's just 4X shorter sequence length uh

49:14

the Transformer will just spend four

49:17

times more compute on the byte level

49:19

sequence if it was fed it and it'll

49:21

spend four times Less on the bpe

49:23

sequence of this feather even though

49:24

they have the same information content

49:26

so we don't have some kind of algorithm

49:28

which could like kind of fan out and

49:31

then just like process the byte level

49:33

sequence with the same amount of

49:35

approximate compute

49:37

and I think that really hurts images

49:39

like if we had some kind of architecture

49:40

that could quite gracefully try and like

49:43

think at the frequency of like useful

49:46

for uh no matter whether it's looking at

49:49

high definition image or quite a low

49:50

definition image or it's looking at 24

49:52

kilohertz audio or 16 kilohertz audio

49:55

just like we do I think we're very

49:57

graceful with things like that we have

50:00

kind of

50:01

like very like selective attention-based

50:03

Vision we are able to like process audio

50:06

and kind of we're able to like have a

50:09

kind of our own internal kind of

50:10

thinking frequency that works for us and

50:13

this is just something that's like a

50:14

clear limitation in our architecture so

50:17

yeah right now if you just model pixel

50:19

level with a Transformer very wasteful

50:20

and it's not something

50:22

um that's like the optimal thing to do

50:24

right now but given there's a clear

50:26

limitation on our architecture it's

50:28

possible it's still the right thing to

50:29

do it's just we need to figure out how

50:30

to do it efficiently

50:33

so does that suggest that a model that

50:35

could

50:36

um you know switch between different

50:37

resolutions uh like at the one token and

50:41

time resolution that's important for

50:42

text versus the

50:44

um I don't know I think you mentioned

50:45

you know the 24 kilohertz of audio does

50:48

that suggest that a module that a model

50:50

like that would uh be able to compress

50:53

like different modalities better

50:56

um and have you know higher sensory yeah

51:00

that's I think it's it would be crazy to

51:02

write it off at this stage anyway I

51:04

think a lot of people assume like oh

51:06

pixel level modeling it just doesn't

51:08

make sense on some fundamental level but

51:10

it's hard to know that whilst we still

51:12

have a big uh kind of fundamental

51:15

blocker with our best architecture so

51:18

yeah I think it's I wouldn't write it

51:20

off anyway

51:22

so Michael is slacking me he wants me to

51:24

ask if you follow the S4 line of work

51:26

yeah

51:27

yeah I think that's a really important

51:29

architecture

51:30

sorry go on

51:32

yeah I I was just uh so S4 uh so okay so

51:36

I guess for for those for those

51:37

listening S4 has a property where

51:40

um it's it was designed explicitly for

51:41

long sequences

51:43

um and one of the uh early uh set of uh

51:48

you know driving applications was this

51:50

pixel pixel by pixel image

51:52

classification

51:54

um sequential cfar uh that they called

51:56

it

51:57

um and uh one of the interesting things

52:00

that S4 can do is actually switch from

52:03

um the these different uh resolutions by

52:07

um uh by changing essentially some the

52:11

the parameterization a little bit

52:13

um

52:14

so does that suggest you that like

52:17

something like S4 or something with a

52:20

different

52:21

um you know encoding would uh would have

52:25

these like implications for I don't know

52:27

being more intelligent or or being a

52:29

better compressor of these other

52:31

modalities or something like that yeah

52:33

so like on a broad brushstroke like S4

52:36

allows you to maybe have a much longer

52:38

context uh than attention without paying

52:41

the quadratic compute cost uh there are

52:43

still other I don't think it solves

52:45

everything but I think it seems like a

52:47

very very promising like piece of

52:50

architecture development

52:52

um I think other parts are like even

52:54

within your MLP like linears in your

52:56

MLPs which are actually for a large

52:58

language than most of your compute

53:00

um you really want to be spending well

53:03

I'm saying I don't know this for sure

53:05

but it feels like there should be a very

53:07

non-uniform allocation of compute uh

53:09

depending on what is easy to think about

53:11

what it's hard to think about

53:12

um and so yeah if there's a more natural

53:15

way of

53:16

there was a cool paper called calm which

53:19

uh it was about early exiting like

53:22

essentially when neural network or some

53:24

intermediate layer feels like it's it's

53:26

it's done enough compute and it can now

53:28

just like skip all the way to the end

53:30

that was kind of an idea in that regime

53:32

but like this kind of adaptive compute

53:33

theme I think it could be a really

53:35

really big

53:36

[Music]

53:37

um

53:38

like

53:39

breakthrough towards this if we think of

53:41

our own thoughts it's like very it's

53:43

very sparse very non-uniform

53:45

and uh you know maybe some of that stuff

53:47

is written in From Evolution but but

53:49

yeah having like this incredibly

53:51

homogenous uniform compute for every

53:53

token uh it doesn't quite feel right so

53:56

yeah I think S4 is very cool I think it

53:57

could be could help in this direction

53:59

for sure

54:00

interesting uh we did get one more

54:03

question from the class that I wanted to

54:05

get your opinion on so the question is

54:07

do you think compression research for

54:09

the sake of compression uh is important

54:11

for these I guess for these like

54:13

intelligent simple implications

54:16

um reacting a little bit to the comments

54:17

on the hudder prize

54:19

um and it sounds like the compression

54:21

capabilities of the foundation models

54:22

are kind of byproducts instead of the

54:25

primary goal when training them

54:27

yeah so this is what I think I think um

54:30

the compression objective is the only

54:32

training objective that I know right now

54:34

uh which is completely non-gameable and

54:37

has a very rigorous Foundation of why it

54:40

should help us create better and better

54:42

generalizing agents in a better

54:43

perceptual system

54:45

however we should be continually

54:48

evaluating models based on their

54:51

capabilities which is fundamentally what

54:52

we care about and so the compression

54:55

like metric itself is one of the most

54:58

like harsh alien metrics you can look at

54:59

it's just a number that means almost

55:01

nothing to us and actually just as that

55:04

number goes down like say or should I

55:07

say the compression rate goes up or the

55:09

kind of bits per character say go down

55:11

it's very unobvious what's going to

55:13

happen

55:14

um so you have to have other evals where

55:16

we can try and like predict the

55:17

emergence of new capabilities or track

55:19

them because those are the things that

55:20

fundamentally people care about uh but I

55:23

think people that either do research in

55:26

this area or study at study at

55:28

University it's prestigious as Stanford

55:30

should have a good understanding of why

55:33

all of this makes sense

55:35

um but I still but I do think yeah that

55:37

doesn't necessarily means it needs to

55:39

completely go in everything about this

55:41

every research and doing research for

55:42

the impression itself I didn't think

55:44

it's necessarily the right way to think

55:46

about it

55:47

um yeah hopefully that answers that

55:49

question

55:50

I wonder if

55:52

um that has implications for things like

55:54

training for more than one Epoch uh I

55:57

think somehow the field recently has

56:01

um uh arrived at the idea that you

56:03

should only you know see all your

56:04

training data once

56:06

um yeah I've got response to that so

56:08

actually training for more than one

56:09

Epoch is not um it's not like if you do

56:13

it literally yeah then it doesn't really

56:15

make sense from a compression

56:16

perspective because once you've finished

56:18

your epoch you can't count the log loss

56:21

of your second Epoch towards your

56:22

compression objective because a very

56:24

powerful model by that point if you did

56:26

it could just have like say users

56:28

retrieval let's memorize everything

56:29

you've seen and then it's just going to

56:31

get perfect performance from then on

56:32

that obviously is not creating a more

56:34

intelligent system but it might like

56:36

it'll minimize your training also make

56:38

you feel good about yourself

56:40

um so at the same time yeah training is

56:42

more than one Epoch can give you better

56:43

generalization what's happening

56:45

um

56:46

I think the way to think about it is the

56:49

ideal setup would be like in RL you have

56:51

this initial replay so you're going

56:52

through you're going through your Epoch

56:54

in theory like all you can count towards

56:56

your like compression score is your

56:58

prediction for the next held out piece

56:59

of training data but there's no reason

57:01

why you couldn't then actually chew up

57:03

and like spend more SGD steps on like

57:06

past data so I think in the compression

57:08

setup multi Epoch just looks like replay

57:10

essentially now in practice I think just

57:13

pragmatically it's easier to just train

57:14

with with multiple epochs

57:16

um you know so yeah I think I just want

57:19

to clear up like compression does not

57:21

it's not actually synonymous with only

57:22

training for one Epoch because you can

57:24

still do replay and essentially see your

57:25

data multiple times but it basically

57:27

says you can only like

57:29

score yourself for all of the

57:31

predictions which will let your next

57:32

batch of data have held out data that's

57:35

the only thing that's the fair thing

57:36

just came out of school

57:38

hopefully

57:40

so we're nearing the end of the hour so

57:43

I wanted to just give you a chance uh if

57:45

there's anything

57:47

um you know that you're excited about

57:48

coming out uh anything in the pipeline

57:50

that that you wanted to talk about and

57:52

just wanted to give you a chance to kind

57:54

of give a preview of what may be next in

57:56

this area uh and kind of uh what's

57:59

coming up and exciting for you

58:04

um

58:05

um okay

58:06

well

58:09

I think 2023

58:12

it doesn't need me to really sell it

58:14

very much I think it's going to be

58:15

pretty much like every week something

58:16

amazing is going to happen so

58:19

um if not every week then every two

58:21

weeks the pace of innovation right now

58:23

I'm sure as you're very aware is pretty

58:25

incredible I think there's going to be

58:27

lots of stuff

58:28

amazing stuff coming out from companies

58:31

in the Bay Area such as open AI uh and

58:34

around the world in in Foundation models

58:37

both in the development of stronger ones

58:40

but also this incredible amount of

58:42

Downstream research that there's just

58:44

such a huge community of people using

58:46

these things now tinkering with them

58:47

exploring capabilities so yeah I feel

58:50

like we're kind of in a in a cycle of

58:53

mass

58:55

um Innovation so I think yeah it's just

58:58

strap in and try not to get too

59:00

overwhelmed

59:02

yeah

59:03

it's a bit it's looking to be a very

59:06

exciting year

59:07

absolutely

59:09

right

59:10

um yeah so that brings us to the end of

59:13

the hour so I wanted to thank you Jack

59:14

again for coming on it was a very

59:16

interesting talk uh thanks of course

59:18

everybody who's listening online and in

59:20

the class for for your great questions

59:22

um if uh this Wednesday we're gonna have

59:24

Susan John from meta she's going to be

59:25

talking a little bit about the trials of

59:27

training uh opt 175 billion so that

59:30

would be very interesting for us to uh

59:33

to talk to her and hear about

59:35

um if you want to you can go to our

59:36

website

59:38

mlsys.stanford.edu to see the rest of

59:40

our schedule I believe we only have one

59:42

more week left

59:43

um so it's been it's been an exciting

59:45

quarter thank you of course everyone for

59:47

participating

59:49

um when with that we will uh wave

59:51

goodbye and say goodbye to YouTube