In conversation | Geoffrey Hinton and Joel Hellermark

Sana
20 May 202445:46

Summary

TLDR这段对话涵盖了深度学习和人工智能领域的多个主题。讨论了如何选拔人才,以及直觉在选择过程中的作用。回顾了在卡内基梅隆大学和爱丁堡大学的研究经历,探讨了神经网络、深度学习和大脑工作方式之间的联系。提到了早期对人工智能的探索,包括与Terry Sejnowski和Peter Brown的合作,以及对神经网络权重调整的兴趣。还讨论了大型语言模型的潜力,以及它们如何通过寻找共同结构来编码信息,从而实现创造性的类比和推理。最后,讨论了GPU在神经网络训练中的作用,以及对未来计算发展的思考。

Takeaways

  • 🧠 对话中提到了关于大脑如何学习和AI发展的深刻见解,强调了大脑学习机制与AI算法之间的联系。
  • 🤖 讨论了早期对AI的探索,包括对神经网络和机器学习的兴趣,以及早期研究的挑战和失望。
  • 🔍 强调了直觉在选择人才和研究方向时的重要性,以及与Ilia的合作如何推动了AI领域的发展。
  • 🤝 描述了与不同学者的合作经历,如与Terry Sejnowski和Peter Brown的合作,以及这些合作如何影响了AI的进步。
  • 📚 讨论了早期对哲学和生理学的失望,以及转向AI和神经网络研究的过程。
  • 💡 提到了Donald Hebb和John von Neumann的工作对AI研究的影响,以及他们对神经网络和大脑计算的兴趣。
  • 🧐 强调了对大型神经网络的直觉和早期对这些模型潜力的认识,以及它们如何能够超越简单的符号处理。
  • 🔗 讨论了如何通过预测下一个符号或单词来训练模型,以及这种方法如何迫使模型进行理解。
  • 🔢 描述了如何使用梯度和优化器来改进神经网络,以及Ilia在这方面的直觉如何帮助推动研究。
  • 🌐 讨论了多模态学习的重要性,以及它如何帮助模型更好地理解空间和对象。
  • 🚀 强调了GPU在训练大型神经网络中的作用,以及这一技术如何推动了整个AI领域的发展。

Q & A

  • 在卡内基梅隆大学的实验室工作环境是怎样的?

    -在卡内基梅隆大学,学生们在周六晚上仍然在实验室编程,因为他们相信自己在研究未来的计算机科学。这与英格兰的文化形成了鲜明对比,在那里研究人员在晚上六点后会去酒吧放松。

  • 你在剑桥大学学习脑科学的经历如何?

    -在剑桥大学学习脑科学让我感到失望,因为他们只教导神经元如何传导动作电位,而没有真正解释大脑如何工作。后来我转向哲学,但也没有找到满意的答案,最终我在爱丁堡大学学习人工智能,开始对模拟大脑运作产生兴趣。

  • 是什么引导你对人工智能产生兴趣?

    -我受到唐纳德·赫布的一本书的影响,他对如何通过学习来调整神经网络的连接强度非常感兴趣。还有冯·诺依曼的一本书也影响了我,他探讨了大脑如何计算以及与普通计算机的不同。

  • 你与特里·塞诺夫斯基的合作如何?

    -我与特里·塞诺夫斯基在博尔兹曼机研究上的合作非常紧密,我们每月会面,共同研究和讨论。尽管许多技术成果很有趣,但最终我们认为这并不是大脑工作的方式。

  • 伊利亚·苏茨克维尔第一次找你时的情景是怎样的?

    -伊利亚第一次找我时是一个周日,他敲门并告诉我他暑假在炸薯条,但更想在我的实验室工作。我让他预约,但他当时就想讨论。后来他证明了自己的直觉和数学能力都非常出色。

  • 你对大规模语言模型的看法是什么?

    -我认为大规模语言模型通过预测下一个符号,迫使模型理解已经被说过的话。虽然有人认为这些模型只是简单地预测下一个符号,但实际上它们在预测过程中需要进行一定的推理,从而逐渐变得更有创意和智能。

  • 你对多模态模型的看法是什么?

    -多模态模型可以通过结合视觉、声音等多种数据源,提升对空间和物体的理解能力。这种模型不仅可以从语言中学习,还可以通过视频和图像数据,显著提高其推理和理解能力。

  • 你对人工智能发展方向的主要关注点是什么?

    -我对人工智能的主要关注点包括其在医疗保健中的应用和可能的社会影响。AI有潜力大大提高医疗效率,但也有可能被恶意使用,如大规模监控或操纵舆论。这需要我们在发展技术的同时,谨慎对待其潜在的负面影响。

  • 你对大规模神经网络训练的见解是什么?

    -我认为大规模神经网络训练中,反向传播是一种正确的做法,通过获取梯度来调整参数,这种方法在实际应用中非常成功。尽管可能存在其他替代方法,但反向传播在理论和实践中都被证明是有效的。

  • 你认为如何有效选择和培养人才?

    -在选择和培养人才时,有时直觉非常重要。例如,与伊利亚的初次会面让我感受到他的才华。此外,我认为实验室需要多样化的学生,有些学生在技术上非常强,而有些则非常有创意。不同类型的学生共同合作可以带来更好的研究成果。

Outlines

00:00

🤖 人工智能与神经网络的探索之旅

本段落讲述了一位研究者在卡内基梅隆大学的经历,以及他对人工智能和神经网络的早期探索。他回忆了在英国剑桥学习生理学和哲学时的失望,因为这些学科没有解答他关于大脑如何工作的疑惑。他最终转向了爱丁堡大学研究人工智能,并被Donald Hebb和John von Neumann关于神经网络和大脑计算方式的书籍所吸引。他认为大脑学习的方式不是通过逻辑规则,而是通过改变神经网络中的连接强度。

05:04

👨‍💼 研究合作与直觉在选择人才中的作用

这段落讲述了研究者与非卡内基梅隆大学的Terry Sejnowski的合作经历,以及他们如何共同研究神经网络。研究者强调了选择人才时直觉的重要性,分享了他如何通过直觉选择Ilia这样的学生,并强调了Ilia在数学和AI方面的早期兴趣和直觉。研究者还提到了与Peter Brown的合作,他是一位统计学家,对隐藏马尔可夫模型的研究产生了重要影响。

10:06

🚀 神经网络与直觉的碰撞

这段对话展示了研究者与Ilia之间的互动,Ilia是一位对神经网络和优化器有深刻见解的学生。他们讨论了梯度下降和函数优化器的使用,以及Ilia如何迅速理解并质疑现有的神经网络训练方法。研究者分享了他与Ilia合作的乐趣,以及他们如何共同解决问题,推动了人工智能领域的发展。

15:07

🧠 神经网络的学习和理解

研究者讨论了神经网络如何通过预测下一个符号来学习语言,他认为这种方法迫使模型进行理解,从而实现类似于人类的推理。他强调了大型神经网络能够进行推理,并可能随着规模的增大而变得更加创造性。此外,他还提到了AlphaGo的例子,展示了在特定领域内,通过强化学习可以实现超越现有知识的创新。

20:08

🔍 神经网络的推理与多模态学习

这段对话探讨了神经网络如何通过扩展到多模态数据(如图像、视频和声音)来增强其理解和推理能力。研究者认为,多模态学习将使模型在空间理解方面变得更加强大,并有助于发现不同领域之间的深层联系。他还讨论了人类大脑是否为语言而进化,以及语言如何与认知相互作用。

25:10

💡 神经网络的创新与未来发展

研究者分享了他对于神经网络未来发展的看法,他认为神经网络将通过发现不同事物之间的共同结构来提高效率,并可能在创造力方面超越人类。他还讨论了如何通过训练模型进行自我校正,以提高其推理能力,并预测了多模态模型将如何改变AI领域。

30:11

🔧 神经网络的计算与硬件发展

这段对话回顾了研究者如何预见到GPU在训练神经网络中的潜力,并分享了他在这方面的早期工作。他还讨论了未来计算的发展方向,包括模拟计算和数字计算的优劣,以及如何使AI系统更加高效和节能。

35:11

🌟 神经网络与大脑的时间尺度

研究者探讨了大脑与神经网络在时间尺度上的差异,指出大脑有多个时间尺度来改变权重,而当前的神经网络模型通常只有两个。他认为未来神经网络需要引入更多时间尺度的概念,以更接近大脑的工作方式。

40:11

🤔 神经网络的意识与情感

这段对话讨论了神经网络是否能够模拟人类的意识和情感。研究者认为,如果神经网络能够进行自我反思并拥有持久的记忆,它们可能会发展出类似人类的情感体验。他还分享了自己对于情感和意识的看法,以及它们如何与行动和约束相关联。

45:12

🎯 神经网络研究的未来方向

研究者分享了他对于神经网络研究未来方向的看法,包括对大脑是否使用反向传播的好奇,以及对多时间尺度学习的兴趣。他还讨论了如何选择合适的研究问题,并强调了好奇心在驱动研究中的重要性。

🏆 神经网络研究的成就与反思

在这段对话中,研究者反思了自己在神经网络领域的成就,特别是他在Boltzmann机器学习算法方面的工作。他表达了对这项工作的自豪感,即使它在实践中可能并不完美。他还讨论了对当前研究的关注点,以及对未来的思考。

Mindmap

Keywords

💡神经网络

神经网络是一种受人脑结构启发的数学模型,用于模拟大脑神经元网络处理信息的方式。在视频中,神经网络是讨论的核心,特别是在讨论人工智能和机器学习的发展过程中。例如,提到了如何通过神经网络来模拟大脑的工作,以及它们如何通过学习改变连接的权重。

💡直觉

直觉是指在没有明确逻辑推理的情况下迅速做出的判断或理解。在视频中,直觉被提及为选择人才和进行科学研究时的一个重要因素。例如,当Ilia出现在门口时,根据直觉认为他是一个聪明的人,并决定与他合作。

💡梯度

在数学和机器学习中,梯度是一个多变量函数的导数向量,指向函数增长最快的方向。视频中提到梯度对于神经网络学习的重要性,尤其是在讨论反向传播算法时,该算法利用梯度来更新网络中的权重。

💡反向传播

反向传播是一种在神经网络中计算损失函数梯度的算法,它是训练神经网络的关键步骤。视频中讨论了反向传播的重要性,以及它是如何帮助网络通过调整权重来学习复杂任务的。

💡多模态数据

多模态数据指的是结合了多种不同类型的数据,如文本、图像、声音等。视频中提到了使用多模态数据来训练大型模型的潜力,这可能会提高模型的理解和推理能力。

💡隐马尔可夫模型

隐马尔可夫模型是一种统计模型,它用来描述一个包含隐藏未知参数的马尔可夫过程。在视频中,隐马尔可夫模型被提及为一种重要的算法,它在语音识别等领域有着广泛的应用。

💡创造力

创造力是指产生新颖和有价值的想法的能力。视频中讨论了大型语言模型如何通过发现不同事物之间的共同结构来编码信息,从而可能产生创造力。

💡自我学习

自我学习是指系统通过与环境交互来自我改进的过程。在视频中,AlphaGo的例子被用来说明自我学习如何使系统超越当前的科学水平,通过自我对弈开发出创新的策略。

💡模拟

模拟是指使用计算机程序或模型来模仿现实世界的过程或系统。视频中提到了模拟的概念,尤其是在讨论如何使用神经网络来模拟大脑的工作以及测试理论。

💡隐喻

隐喻是一种修辞手法,通过将两个不同领域的事物相比较来传达新的意义或理解。视频中提到了大型语言模型如何通过隐喻来理解和压缩信息,例如将堆肥堆与原子弹进行比较。

💡意识

意识通常指的是个体对自己内心状态和外部环境的感知。在视频中,讨论了人工智能是否能够模拟人类意识的问题,以及意识是否可以被看作是行动的倾向,而不是内在剧场的体验。

Highlights

在卡内基梅隆大学,学生和研究人员对计算机科学的未来充满信念,认为他们的工作将改变计算机科学的进程。

在剑桥大学学习生理学时,对大脑工作原理的失望,因为所学仅限于神经元如何传导动作电位,并未解释大脑如何工作。

转向哲学寻求对心智工作原理的理解,但同样感到失望。

在爱丁堡大学研究人工智能(AI),感到更加有趣,因为可以通过模拟来测试理论。

Donald Hebb的书籍对理解神经网络中的连接强度学习有重要影响。

John von Neumann的书籍对大脑计算方式与常规计算机的差异感兴趣。

在爱丁堡时期,对大脑如何通过修改神经网络中的连接进行学习有坚定信念。

与John Hopkins的Terry Sejnowski的合作,共同研究神经网络和大脑工作原理。

与统计学家Peter Brown的合作,学习了关于隐藏马尔可夫模型的知识。

Ilia的到来和对反向传播算法的直觉,提出了将梯度信息提供给优化器的想法。

Ilia的独立思考能力和对AI的早期兴趣对其直觉发展的影响。

在AI研究中,数据规模和计算规模的增加比新算法更为重要。

使用字符级预测的论文展示了深度学习模型在理解文本方面的能力。

深度学习模型通过预测下一个符号来理解问题,并非简单的符号预测。

大型语言模型通过寻找共同结构来编码信息,从而提高效率。

多模态模型的发展,将提高模型在空间理解、推理和创造力方面的能力。

关于语言与认知的关系,存在三种不同的观点,其中最新的认为语言符号被转化为丰富的嵌入向量,并通过这些向量的交互来理解语言。

使用GPU进行神经网络训练的早期直觉和对计算领域的影响。

关于是否使用快速权重(fast weights)的讨论,以及它们在大脑中的潜在作用。

关于意识模拟的讨论,以及AI助手可能发展出类似人类的情感和自我反思能力。

对于如何选择合适的研究问题,强调了好奇心驱动的研究和对普遍认同观点的质疑。

对于神经网络和大脑是否使用反向传播的长期疑问,以及这对未来研究的启示。

对于AI技术可能带来的负面影响的担忧,包括被用于不良目的如杀手机器人、操纵公众舆论或大规模监控。

对于AI研究过程中AI助手可能带来的影响,包括提高研究效率和帮助思考。

对于直觉发展的看法,强调了对事实的批判性接受和信任自己的直觉。

对于当前AI领域的研究方向,认为大型模型和多模态数据训练是一个有前景的方向。

对于个人最自豪的成就,认为是开发Boltzmann机器的学习算法,尽管它可能在实践中不切实际。

Transcripts

00:00

have

00:00

you reflected a lot on how to select

00:04

Talent or has that mostly been like

00:07

intuitive to you Ilia just shows up and

00:09

you're like this is a clever guy let's

00:11

let's work together or have you thought

00:13

a lot about that can we are we recording

00:15

should we should we roll This yeah let's

00:18

roll this okay we're good yeah

00:20

yeah

00:24

okay s is working

00:30

so I remember when I first got to K

00:32

melon from England in England at a

00:34

Research Unit it would get to be 6:00

00:36

and you'd all go for a drink in the pub

00:39

um at Caril melon I remember after I've

00:41

been there a few weeks it was Saturday

00:43

night I didn't have any friends yet and

00:46

I didn't know what to do so I decided

00:47

I'd go into the lab and do some

00:48

programming because I had a list machine

00:50

and you couldn't program it from home so

00:52

I went into the lab at about 9:00 on a

00:53

Saturday night and it was swarming all

00:57

the students were there and they were

00:59

all there because what they were working

01:01

on was the future they all believed that

01:03

what they did next was going to change

01:05

the course of computer science and it

01:07

was just so different from England and

01:09

so that was very refreshing take me back

01:12

to the very beginning Jeff at Cambridge

01:16

uh trying to understand the brain uh

01:18

what was that like it was very

01:21

disappointing so I did physiology and in

01:24

the summer term they were going to teach

01:25

us how the brain worked and it all they

01:27

taught us was how neurons conduct action

01:30

potentials which is very interesting but

01:32

it doesn't tell you how the brain works

01:34

so that was extremely disappointing I

01:36

switched to philosophy then I thought

01:38

maybe they'd tell us how the mind worked

01:40

um that was very disappointing I

01:42

eventually ended up going to Edinburgh

01:43

to do Ai and that was more interesting

01:46

at least you could simulate things so

01:48

you could test out theories and did you

01:50

remember what intrigued you about AI was

01:53

it a paper was it any particular person

01:56

that exposed you to those ideas I guess

01:59

it was a book I read by Donald Hebb that

02:01

influenced me a lot um he was very

02:05

interested in how you learn the

02:07

connection strengths in neural Nets I

02:09

also read a book by John Fon noyman

02:11

early on um who was very interested in

02:15

how the brain computes and how it's

02:16

different from normal computers and did

02:19

you get that conviction that this ideas

02:22

would work out at at that point or what

02:25

would was your intuition back at the

02:27

Edinburgh days it seemed to me there has

02:31

to be a way that the brain

02:33

learns and it's clearly not by having

02:36

all sorts of things programmed into it

02:39

and then using logical rules of

02:40

inference that just seemed to me crazy

02:42

from the outset um so we had to figure

02:46

out how the brain learned to modify

02:49

Connections in a neural net so that it

02:50

could do complicated things and Fon

02:53

Norman believed that churing believed

02:55

that so Forman and churing were both

02:57

pretty good at logic but they didn't

02:58

believe in this logical approach and

03:01

what was your split between studying the

03:03

ideas from from

03:05

neuroscience and just doing what seemed

03:08

to be good algorithms for for AI how

03:11

much inspiration did you take early on

03:13

so I never did that much study of

03:15

Neuroscience I was always inspired by

03:17

what I'd learned about how the brain

03:19

works that there's a bunch of neurons

03:21

they perform relatively simple

03:23

operations they're nonlinear um but they

03:26

collect inputs they wait them and then

03:29

they an output that depends on that

03:31

weighted input and the question is how

03:33

do you change those weights to make the

03:34

whole thing do something good it seems

03:36

like a fairly simple question what

03:38

collaborations do you remember from from

03:41

that time the main collaboration I had

03:43

at Carnegie melon was with someone who

03:45

wasn't at carnegy melon I was

03:47

interacting a lot with Terry sinowski

03:48

who was in Baltimore at John's Hopkins

03:51

and about once a month either he would

03:53

drive to Pittsburg or I drive to

03:54

Baltimore it's 250 miles away and we

03:57

would spend a weekend together working

03:58

on boltimore machines that was a

04:00

wonderful collaboration we were both

04:01

convinced it was how the brain worked

04:03

that was the most exciting research I've

04:05

ever done and a lot of technical results

04:07

came out that were very interesting but

04:09

I think it's not how the brain works um

04:11

I also had a very good collaboration

04:13

with um Peter Brown who was a very good

04:17

statistician and he worked on speech

04:19

recognition at IBM and then he came as a

04:22

more mature student to kind melon just

04:24

to get a PhD um but he already knew a

04:27

lot he taught me a lot about spee

04:30

and he in fact taught me about hidden

04:31

Markov models I think I learn more from

04:33

him than he learned from me that's the

04:35

kind of student you want and when he Tau

04:38

me about hidden Markov models I was

04:41

doing back propop with hidden layers

04:43

only they weren't called hidden layers

04:44

then and I decided that name they use in

04:47

Hidden Markov models is a great name for

04:49

variables that you don't know what

04:50

they're up to um and so that's where the

04:54

name hidden in neur NS came from me and

04:57

P decided that was a great name for the

04:59

hidden hidden L and your all Nets um but

05:03

I learned a lot from Peter about speech

05:05

take us back to when Ilia showed up at

05:08

your at your office I was in my office I

05:11

probably on a Sunday um and I was

05:14

programming I think and there was a

05:16

knock on the door not just any knock but

05:17

it won't

05:19

cutter it's sort of an urgent knock so I

05:21

went and answer to the door and this was

05:23

this young student there and he said he

05:25

was cooking Fries over the summer but

05:27

he'd rather be working in my lab and so

05:29

I said well why don't you make an

05:30

appointment and we'll talk and so Ilia

05:32

said how about now and that sort of was

05:35

Ila's character so we talked for a bit

05:38

and I gave him a paper to read which was

05:40

the nature paper on back

05:42

propagation and we made another meeting

05:45

for a week later and he came back and he

05:47

said I didn't understand it and I was

05:49

very disappointed I thought he seemed

05:50

like a bright guy but it's only the

05:52

chain rule it's not that hard to

05:54

understand and he said oh no no I

05:56

understood that I just don't understand

05:58

why you don't give the gradient to a

06:00

sensal a sensible function

06:02

Optimizer which took us quite a few

06:04

years to think about um and it kept on

06:07

like that with a he had very good his

06:09

raw intuitions about things were always

06:11

very good what do you think had enabled

06:14

those uh those intuitions for for Ilia I

06:17

don't know I think he always thought for

06:19

himself he was always interested in AI

06:21

from a young age um he's obviously good

06:24

at math so but it's very hard to know

06:27

and what was that collaboration between

06:29

between the two of you like what part

06:32

would you play and what part would Ilia

06:34

play it was a lot of fun um I remember

06:37

one occasion when we were trying to do a

06:41

complicated thing with producing maps of

06:43

data where I had a kind of mixture model

06:46

so you could take the same bunch of

06:47

similarities and make two maps so that

06:50

in one map Bank could be close to Greed

06:52

and in another map Bank could be close

06:54

to River um cuz in one map you can't

06:57

have it close to both right cuz River

06:59

and greed along wayon so we'd have a

07:01

mixture maps and we were doing it in mat

07:05

lab and this involved a lot of

07:06

reorganization of the code to do the

07:08

right Matrix multiplies and only got fed

07:10

up with that so he came one day and said

07:12

um I'm going to write a an interface for

07:15

Matlab so I program in this different

07:17

language and then I have something that

07:19

just converts it into Matlab and I said

07:21

no Ilia um that'll take you a month to

07:24

do we've got to get on with this project

07:26

don't get diverted by that and I said

07:28

it's okay I did it this

07:32

morning and that's that's quite quite

07:34

incredible and throughout those those

07:37

years the biggest shift wasn't

07:40

necessarily just the the algorithms but

07:42

but also the the skill how did you sort

07:45

of view that skill uh over over the

07:49

years Ilia got that intuition very early

07:51

so Ilia was always preaching that um you

07:55

just make it bigger and it'll work

07:56

better and I always thought that was a

07:58

bit of a copout do you going to have to

07:59

have new ideas too it turns out I was

08:02

basically right new ideas help things

08:04

like Transformers helped a lot but it

08:06

was really the scale of the data and the

08:09

scale of the computation and back then

08:11

we had no idea computers would get like

08:13

a billion times faster we thought maybe

08:15

they' get a 100 times faster we were

08:17

trying to do things by coming up with

08:19

clever ideas that would have just solved

08:21

themselves if we had had bigger scale of

08:22

the data and computation in about

08:25

2011 Ilia and another graduate student

08:28

called James Martins and

08:30

had a paper using character level

08:32

prediction so we took Wikipedia and we

08:35

tried to predict the next HTML character

08:39

and that worked remarkably well and we

08:41

were always amazed at how well it worked

08:43

and that was using a fancy Optimizer on

08:47

gpus and we could never quite believe

08:50

that it understood anything but it

08:52

looked as though it

08:53

understood and that just seemed

08:55

incredible can you take us through how

08:58

are do models trained to predict the

09:01

next word and why is it the wrong way of

09:06

of thinking about them okay I don't

09:08

actually believe it is the wrong way so

09:12

in fact I think I made the first

09:13

neuronet language model that used

09:15

embeddings and back propagation so it's

09:18

very simple data just

09:19

triples and it was turning each symbol

09:23

into an embedding then having the

09:25

embeddings interact to predict the

09:27

embedding of the next symbol and from

09:29

that predic the next symbol and then it

09:31

was back propagating through that whole

09:32

process to learn these triples and I

09:35

showed it could generalize um about 10

09:38

years later Yoshua Benji used a very

09:40

similar Network and showed it work with

09:41

real text and about 10 years after that

09:44

linguist started believing in embeddings

09:46

it was a slow process the reason I think

09:49

it's not just predicting the next symbol

09:52

is if you ask well what does it take to

09:54

predict the next symbol particularly if

09:56

you ask me a question and then the first

09:59

word of the answer is the next symbol um

10:03

you have to understand the question so I

10:06

think by predicting the next

10:08

symbol it's very unlike oldfashioned

10:11

autocomplete oldfashioned autocomplete

10:13

you'd store sort of triples of words and

10:16

then if you sort a pair of words you see

10:18

how often different words came third and

10:20

that way you can predict the next symbol

10:22

and that's what most people think auto

10:23

complete is like it's no longer at all

10:26

like that um to predict the next symbol

10:28

you have to understand what's been said

10:30

so I think you're forcing it to

10:31

understand by making it predict the next

10:33

symbol and I think it's understanding in

10:36

much the same way we are so a lot of

10:38

people will tell you these things aren't

10:40

like us um they're just predicting the

10:42

next symbol they're not reasoning like

10:44

us but actually in order to predict the

10:47

next symbol it's have going to have to

10:48

do some reasoning and we've seen now

10:50

that if you make big ones without

10:52

putting in any special stuff to do

10:53

reasoning they can already do some

10:55

reasoning and I think as you make them

10:57

bigger they're going to be able to do

10:58

more and more reasoning do you think I'm

11:00

doing anything else than predicting the

11:01

next symbol right now I think that's how

11:04

you're learning I think you're

11:06

predicting the next video frame um

11:08

you're predicting the next sound um but

11:11

I think that's a pretty plausible theory

11:13

of how the brain's learning what enables

11:16

these models to learn such a wide

11:19

variety of of fields what these big

11:21

language models are doing is they

11:23

looking for common structure and by

11:25

finding common structure they can encode

11:27

things using the common structure and

11:29

that more efficient so let me give you

11:31

an example if you ask

11:33

gp4 why is a compost heap like an atom

11:36

bomb most people can't answer that most

11:39

people haven't thought they think atom

11:41

bombs and compost heeps are very

11:42

different things but gp4 will tell you

11:44

well the energy scales are very

11:46

different and the time scales are very

11:48

different but the thing that's the same

11:51

is that when the compost Heep gets

11:52

hotter it generates heat faster and when

11:55

the atom bomb produces more NE neutrons

11:57

it produces more neutrons faster

12:00

and so it gets the idea of a chain

12:02

reaction and I believe it's understood

12:04

they're both forms of chain reaction

12:06

it's using that understanding to

12:08

compress all that information into its

12:09

weights and if it's doing that then it's

12:13

going to be doing that for hundreds of

12:15

things where we haven't seen the

12:16

analogies yet but it has and that's

12:18

where you get creativity from from

12:20

seeing these analogies between

12:21

apparently very different things and so

12:23

I think gp4 is going to end up when it

12:25

gets bigger being very creative I think

12:27

this idea that it's just just

12:29

regurgitating what it's learned just

12:31

pasing together text it's learned

12:33

already that's completely wrong it's

12:35

going to be even more creative than

12:37

people I think you'd argue that it won't

12:40

just repeat the human knowledge we've

12:43

developed so far but could also progress

12:46

beyond that I think that's something we

12:48

haven't quite seen yet we've started

12:51

seeing some examples of it but to a to a

12:53

large extent we're sort of still at the

12:56

current level of of of science what do

12:58

you think will enable it to go beyond

13:00

that well we've seen that in more

13:01

limited context like if you take Alpha

13:04

go in that famous competition with Leo

13:08

um there was move 37 where Alpha go made

13:11

a move that all the experts said must

13:13

have been a mistake but actually later

13:15

they realized it was a brilliant move um

13:18

so that was created within that limited

13:20

domain um I think we'll see a lot more

13:22

of that as these things get bigger the

13:25

difference with alphao as well was that

13:28

it was using reinforcement learning that

13:31

that subsequently sort of enabled it to

13:33

to go beyond the current state so it

13:35

started with imitation learning watching

13:37

how humans play the game and then it

13:39

would through selfplay develop Way

13:42

Beyond that do you think that's the

13:43

missing component of the I think that

13:46

may well be a missing component yes that

13:48

the the self-play in Alpha in Alpha go

13:51

and Alpha zero are are a large part of

13:54

why it could make these creative moves

13:56

but I don't think it's entirely

13:58

necessary

13:59

so there's a little experiment I did a

14:01

long time ago where you your training in

14:03

neuronet to recognize handwritten digits

14:06

I love that example the mest example and

14:09

you give it training data where half the

14:11

answers are

14:12

wrong um and the question is how well

14:15

will it

14:17

learn and you make half the answers

14:20

wrong once and keep them like that so it

14:23

can't average away the wrongness by just

14:25

seeing the same example but with the

14:27

right answer sometimes and the wrong

14:28

answer sometimes when it sees that

14:29

example half half of the examples when

14:32

it sees the example the answer is always

14:33

wrong and so the training data has 50%

14:37

error but if you train up back

14:40

propagation it gets down to 5% error or

14:44

less other words from badly labeled data

14:49

it can get much better results it can

14:51

see that the training data is wrong and

14:54

that's how smart students can be smarter

14:55

than their advisor and their advisor

14:57

tells them all this stuff

14:59

and for half of what their advisor tells

15:01

them they think no rubbish and they

15:03

listen to the other half and then they

15:05

end up smarter than the advisor so these

15:06

big neural Nets can actually do they can

15:09

do much better than their training data

15:11

and most people don't realize that so

15:13

how how do you expect this models to add