In conversation | Geoffrey Hinton and Joel Hellermark

Sana
20 May 202445:46

Summary

TLDR这段对话涵盖了深度学习和人工智能领域的多个主题。讨论了如何选拔人才,以及直觉在选择过程中的作用。回顾了在卡内基梅隆大学和爱丁堡大学的研究经历,探讨了神经网络、深度学习和大脑工作方式之间的联系。提到了早期对人工智能的探索,包括与Terry Sejnowski和Peter Brown的合作,以及对神经网络权重调整的兴趣。还讨论了大型语言模型的潜力,以及它们如何通过寻找共同结构来编码信息,从而实现创造性的类比和推理。最后,讨论了GPU在神经网络训练中的作用,以及对未来计算发展的思考。

Takeaways

  • 🧠 对话中提到了关于大脑如何学习和AI发展的深刻见解,强调了大脑学习机制与AI算法之间的联系。
  • 🤖 讨论了早期对AI的探索,包括对神经网络和机器学习的兴趣,以及早期研究的挑战和失望。
  • 🔍 强调了直觉在选择人才和研究方向时的重要性,以及与Ilia的合作如何推动了AI领域的发展。
  • 🤝 描述了与不同学者的合作经历,如与Terry Sejnowski和Peter Brown的合作,以及这些合作如何影响了AI的进步。
  • 📚 讨论了早期对哲学和生理学的失望,以及转向AI和神经网络研究的过程。
  • 💡 提到了Donald Hebb和John von Neumann的工作对AI研究的影响,以及他们对神经网络和大脑计算的兴趣。
  • 🧐 强调了对大型神经网络的直觉和早期对这些模型潜力的认识,以及它们如何能够超越简单的符号处理。
  • 🔗 讨论了如何通过预测下一个符号或单词来训练模型,以及这种方法如何迫使模型进行理解。
  • 🔢 描述了如何使用梯度和优化器来改进神经网络,以及Ilia在这方面的直觉如何帮助推动研究。
  • 🌐 讨论了多模态学习的重要性,以及它如何帮助模型更好地理解空间和对象。
  • 🚀 强调了GPU在训练大型神经网络中的作用,以及这一技术如何推动了整个AI领域的发展。

Q & A

  • 在卡内基梅隆大学的实验室工作环境是怎样的?

    -在卡内基梅隆大学,学生们在周六晚上仍然在实验室编程,因为他们相信自己在研究未来的计算机科学。这与英格兰的文化形成了鲜明对比,在那里研究人员在晚上六点后会去酒吧放松。

  • 你在剑桥大学学习脑科学的经历如何?

    -在剑桥大学学习脑科学让我感到失望,因为他们只教导神经元如何传导动作电位,而没有真正解释大脑如何工作。后来我转向哲学,但也没有找到满意的答案,最终我在爱丁堡大学学习人工智能,开始对模拟大脑运作产生兴趣。

  • 是什么引导你对人工智能产生兴趣?

    -我受到唐纳德·赫布的一本书的影响,他对如何通过学习来调整神经网络的连接强度非常感兴趣。还有冯·诺依曼的一本书也影响了我,他探讨了大脑如何计算以及与普通计算机的不同。

  • 你与特里·塞诺夫斯基的合作如何?

    -我与特里·塞诺夫斯基在博尔兹曼机研究上的合作非常紧密,我们每月会面,共同研究和讨论。尽管许多技术成果很有趣,但最终我们认为这并不是大脑工作的方式。

  • 伊利亚·苏茨克维尔第一次找你时的情景是怎样的?

    -伊利亚第一次找我时是一个周日,他敲门并告诉我他暑假在炸薯条,但更想在我的实验室工作。我让他预约,但他当时就想讨论。后来他证明了自己的直觉和数学能力都非常出色。

  • 你对大规模语言模型的看法是什么?

    -我认为大规模语言模型通过预测下一个符号,迫使模型理解已经被说过的话。虽然有人认为这些模型只是简单地预测下一个符号,但实际上它们在预测过程中需要进行一定的推理,从而逐渐变得更有创意和智能。

  • 你对多模态模型的看法是什么?

    -多模态模型可以通过结合视觉、声音等多种数据源,提升对空间和物体的理解能力。这种模型不仅可以从语言中学习,还可以通过视频和图像数据,显著提高其推理和理解能力。

  • 你对人工智能发展方向的主要关注点是什么?

    -我对人工智能的主要关注点包括其在医疗保健中的应用和可能的社会影响。AI有潜力大大提高医疗效率,但也有可能被恶意使用,如大规模监控或操纵舆论。这需要我们在发展技术的同时,谨慎对待其潜在的负面影响。

  • 你对大规模神经网络训练的见解是什么?

    -我认为大规模神经网络训练中,反向传播是一种正确的做法,通过获取梯度来调整参数,这种方法在实际应用中非常成功。尽管可能存在其他替代方法,但反向传播在理论和实践中都被证明是有效的。

  • 你认为如何有效选择和培养人才?

    -在选择和培养人才时,有时直觉非常重要。例如,与伊利亚的初次会面让我感受到他的才华。此外,我认为实验室需要多样化的学生,有些学生在技术上非常强,而有些则非常有创意。不同类型的学生共同合作可以带来更好的研究成果。

Outlines

00:00

🤖 人工智能与神经网络的探索之旅

本段落讲述了一位研究者在卡内基梅隆大学的经历,以及他对人工智能和神经网络的早期探索。他回忆了在英国剑桥学习生理学和哲学时的失望,因为这些学科没有解答他关于大脑如何工作的疑惑。他最终转向了爱丁堡大学研究人工智能,并被Donald Hebb和John von Neumann关于神经网络和大脑计算方式的书籍所吸引。他认为大脑学习的方式不是通过逻辑规则,而是通过改变神经网络中的连接强度。

05:04

👨‍💼 研究合作与直觉在选择人才中的作用

这段落讲述了研究者与非卡内基梅隆大学的Terry Sejnowski的合作经历,以及他们如何共同研究神经网络。研究者强调了选择人才时直觉的重要性,分享了他如何通过直觉选择Ilia这样的学生,并强调了Ilia在数学和AI方面的早期兴趣和直觉。研究者还提到了与Peter Brown的合作,他是一位统计学家,对隐藏马尔可夫模型的研究产生了重要影响。

10:06

🚀 神经网络与直觉的碰撞

这段对话展示了研究者与Ilia之间的互动,Ilia是一位对神经网络和优化器有深刻见解的学生。他们讨论了梯度下降和函数优化器的使用,以及Ilia如何迅速理解并质疑现有的神经网络训练方法。研究者分享了他与Ilia合作的乐趣,以及他们如何共同解决问题,推动了人工智能领域的发展。

15:07

🧠 神经网络的学习和理解

研究者讨论了神经网络如何通过预测下一个符号来学习语言,他认为这种方法迫使模型进行理解,从而实现类似于人类的推理。他强调了大型神经网络能够进行推理,并可能随着规模的增大而变得更加创造性。此外,他还提到了AlphaGo的例子,展示了在特定领域内,通过强化学习可以实现超越现有知识的创新。

20:08

🔍 神经网络的推理与多模态学习

这段对话探讨了神经网络如何通过扩展到多模态数据(如图像、视频和声音)来增强其理解和推理能力。研究者认为,多模态学习将使模型在空间理解方面变得更加强大,并有助于发现不同领域之间的深层联系。他还讨论了人类大脑是否为语言而进化,以及语言如何与认知相互作用。

25:10

💡 神经网络的创新与未来发展

研究者分享了他对于神经网络未来发展的看法,他认为神经网络将通过发现不同事物之间的共同结构来提高效率,并可能在创造力方面超越人类。他还讨论了如何通过训练模型进行自我校正,以提高其推理能力,并预测了多模态模型将如何改变AI领域。

30:11

🔧 神经网络的计算与硬件发展

这段对话回顾了研究者如何预见到GPU在训练神经网络中的潜力,并分享了他在这方面的早期工作。他还讨论了未来计算的发展方向,包括模拟计算和数字计算的优劣,以及如何使AI系统更加高效和节能。

35:11

🌟 神经网络与大脑的时间尺度

研究者探讨了大脑与神经网络在时间尺度上的差异,指出大脑有多个时间尺度来改变权重,而当前的神经网络模型通常只有两个。他认为未来神经网络需要引入更多时间尺度的概念,以更接近大脑的工作方式。

40:11

🤔 神经网络的意识与情感

这段对话讨论了神经网络是否能够模拟人类的意识和情感。研究者认为,如果神经网络能够进行自我反思并拥有持久的记忆,它们可能会发展出类似人类的情感体验。他还分享了自己对于情感和意识的看法,以及它们如何与行动和约束相关联。

45:12

🎯 神经网络研究的未来方向

研究者分享了他对于神经网络研究未来方向的看法,包括对大脑是否使用反向传播的好奇,以及对多时间尺度学习的兴趣。他还讨论了如何选择合适的研究问题,并强调了好奇心在驱动研究中的重要性。

🏆 神经网络研究的成就与反思

在这段对话中,研究者反思了自己在神经网络领域的成就,特别是他在Boltzmann机器学习算法方面的工作。他表达了对这项工作的自豪感,即使它在实践中可能并不完美。他还讨论了对当前研究的关注点,以及对未来的思考。

Mindmap

Keywords

💡神经网络

神经网络是一种受人脑结构启发的数学模型,用于模拟大脑神经元网络处理信息的方式。在视频中,神经网络是讨论的核心,特别是在讨论人工智能和机器学习的发展过程中。例如,提到了如何通过神经网络来模拟大脑的工作,以及它们如何通过学习改变连接的权重。

💡直觉

直觉是指在没有明确逻辑推理的情况下迅速做出的判断或理解。在视频中,直觉被提及为选择人才和进行科学研究时的一个重要因素。例如,当Ilia出现在门口时,根据直觉认为他是一个聪明的人,并决定与他合作。

💡梯度

在数学和机器学习中,梯度是一个多变量函数的导数向量,指向函数增长最快的方向。视频中提到梯度对于神经网络学习的重要性,尤其是在讨论反向传播算法时,该算法利用梯度来更新网络中的权重。

💡反向传播

反向传播是一种在神经网络中计算损失函数梯度的算法,它是训练神经网络的关键步骤。视频中讨论了反向传播的重要性,以及它是如何帮助网络通过调整权重来学习复杂任务的。

💡多模态数据

多模态数据指的是结合了多种不同类型的数据,如文本、图像、声音等。视频中提到了使用多模态数据来训练大型模型的潜力,这可能会提高模型的理解和推理能力。

💡隐马尔可夫模型

隐马尔可夫模型是一种统计模型,它用来描述一个包含隐藏未知参数的马尔可夫过程。在视频中,隐马尔可夫模型被提及为一种重要的算法,它在语音识别等领域有着广泛的应用。

💡创造力

创造力是指产生新颖和有价值的想法的能力。视频中讨论了大型语言模型如何通过发现不同事物之间的共同结构来编码信息,从而可能产生创造力。

💡自我学习

自我学习是指系统通过与环境交互来自我改进的过程。在视频中,AlphaGo的例子被用来说明自我学习如何使系统超越当前的科学水平,通过自我对弈开发出创新的策略。

💡模拟

模拟是指使用计算机程序或模型来模仿现实世界的过程或系统。视频中提到了模拟的概念,尤其是在讨论如何使用神经网络来模拟大脑的工作以及测试理论。

💡隐喻

隐喻是一种修辞手法,通过将两个不同领域的事物相比较来传达新的意义或理解。视频中提到了大型语言模型如何通过隐喻来理解和压缩信息,例如将堆肥堆与原子弹进行比较。

💡意识

意识通常指的是个体对自己内心状态和外部环境的感知。在视频中,讨论了人工智能是否能够模拟人类意识的问题,以及意识是否可以被看作是行动的倾向,而不是内在剧场的体验。

Highlights

在卡内基梅隆大学,学生和研究人员对计算机科学的未来充满信念,认为他们的工作将改变计算机科学的进程。

在剑桥大学学习生理学时,对大脑工作原理的失望,因为所学仅限于神经元如何传导动作电位,并未解释大脑如何工作。

转向哲学寻求对心智工作原理的理解,但同样感到失望。

在爱丁堡大学研究人工智能(AI),感到更加有趣,因为可以通过模拟来测试理论。

Donald Hebb的书籍对理解神经网络中的连接强度学习有重要影响。

John von Neumann的书籍对大脑计算方式与常规计算机的差异感兴趣。

在爱丁堡时期,对大脑如何通过修改神经网络中的连接进行学习有坚定信念。

与John Hopkins的Terry Sejnowski的合作,共同研究神经网络和大脑工作原理。

与统计学家Peter Brown的合作,学习了关于隐藏马尔可夫模型的知识。

Ilia的到来和对反向传播算法的直觉,提出了将梯度信息提供给优化器的想法。

Ilia的独立思考能力和对AI的早期兴趣对其直觉发展的影响。

在AI研究中,数据规模和计算规模的增加比新算法更为重要。

使用字符级预测的论文展示了深度学习模型在理解文本方面的能力。

深度学习模型通过预测下一个符号来理解问题,并非简单的符号预测。

大型语言模型通过寻找共同结构来编码信息,从而提高效率。

多模态模型的发展,将提高模型在空间理解、推理和创造力方面的能力。

关于语言与认知的关系,存在三种不同的观点,其中最新的认为语言符号被转化为丰富的嵌入向量,并通过这些向量的交互来理解语言。

使用GPU进行神经网络训练的早期直觉和对计算领域的影响。

关于是否使用快速权重(fast weights)的讨论,以及它们在大脑中的潜在作用。

关于意识模拟的讨论,以及AI助手可能发展出类似人类的情感和自我反思能力。

对于如何选择合适的研究问题,强调了好奇心驱动的研究和对普遍认同观点的质疑。

对于神经网络和大脑是否使用反向传播的长期疑问,以及这对未来研究的启示。

对于AI技术可能带来的负面影响的担忧,包括被用于不良目的如杀手机器人、操纵公众舆论或大规模监控。

对于AI研究过程中AI助手可能带来的影响,包括提高研究效率和帮助思考。

对于直觉发展的看法,强调了对事实的批判性接受和信任自己的直觉。

对于当前AI领域的研究方向,认为大型模型和多模态数据训练是一个有前景的方向。

对于个人最自豪的成就,认为是开发Boltzmann机器的学习算法,尽管它可能在实践中不切实际。

Transcripts

00:00

have

00:00

you reflected a lot on how to select

00:04

Talent or has that mostly been like

00:07

intuitive to you Ilia just shows up and

00:09

you're like this is a clever guy let's

00:11

let's work together or have you thought

00:13

a lot about that can we are we recording

00:15

should we should we roll This yeah let's

00:18

roll this okay we're good yeah

00:20

yeah

00:24

okay s is working

00:30

so I remember when I first got to K

00:32

melon from England in England at a

00:34

Research Unit it would get to be 6:00

00:36

and you'd all go for a drink in the pub

00:39

um at Caril melon I remember after I've

00:41

been there a few weeks it was Saturday

00:43

night I didn't have any friends yet and

00:46

I didn't know what to do so I decided

00:47

I'd go into the lab and do some

00:48

programming because I had a list machine

00:50

and you couldn't program it from home so

00:52

I went into the lab at about 9:00 on a

00:53

Saturday night and it was swarming all

00:57

the students were there and they were

00:59

all there because what they were working

01:01

on was the future they all believed that

01:03

what they did next was going to change

01:05

the course of computer science and it

01:07

was just so different from England and

01:09

so that was very refreshing take me back

01:12

to the very beginning Jeff at Cambridge

01:16

uh trying to understand the brain uh

01:18

what was that like it was very

01:21

disappointing so I did physiology and in

01:24

the summer term they were going to teach

01:25

us how the brain worked and it all they

01:27

taught us was how neurons conduct action

01:30

potentials which is very interesting but

01:32

it doesn't tell you how the brain works

01:34

so that was extremely disappointing I

01:36

switched to philosophy then I thought

01:38

maybe they'd tell us how the mind worked

01:40

um that was very disappointing I

01:42

eventually ended up going to Edinburgh

01:43

to do Ai and that was more interesting

01:46

at least you could simulate things so

01:48

you could test out theories and did you

01:50

remember what intrigued you about AI was

01:53

it a paper was it any particular person

01:56

that exposed you to those ideas I guess

01:59

it was a book I read by Donald Hebb that

02:01

influenced me a lot um he was very

02:05

interested in how you learn the

02:07

connection strengths in neural Nets I

02:09

also read a book by John Fon noyman

02:11

early on um who was very interested in

02:15

how the brain computes and how it's

02:16

different from normal computers and did

02:19

you get that conviction that this ideas

02:22

would work out at at that point or what

02:25

would was your intuition back at the

02:27

Edinburgh days it seemed to me there has

02:31

to be a way that the brain

02:33

learns and it's clearly not by having

02:36

all sorts of things programmed into it

02:39

and then using logical rules of

02:40

inference that just seemed to me crazy

02:42

from the outset um so we had to figure

02:46

out how the brain learned to modify

02:49

Connections in a neural net so that it

02:50

could do complicated things and Fon

02:53

Norman believed that churing believed

02:55

that so Forman and churing were both

02:57

pretty good at logic but they didn't

02:58

believe in this logical approach and

03:01

what was your split between studying the

03:03

ideas from from

03:05

neuroscience and just doing what seemed

03:08

to be good algorithms for for AI how

03:11

much inspiration did you take early on

03:13

so I never did that much study of

03:15

Neuroscience I was always inspired by

03:17

what I'd learned about how the brain

03:19

works that there's a bunch of neurons

03:21

they perform relatively simple

03:23

operations they're nonlinear um but they

03:26

collect inputs they wait them and then

03:29

they an output that depends on that

03:31

weighted input and the question is how

03:33

do you change those weights to make the

03:34

whole thing do something good it seems

03:36

like a fairly simple question what

03:38

collaborations do you remember from from

03:41

that time the main collaboration I had

03:43

at Carnegie melon was with someone who

03:45

wasn't at carnegy melon I was

03:47

interacting a lot with Terry sinowski

03:48

who was in Baltimore at John's Hopkins

03:51

and about once a month either he would

03:53

drive to Pittsburg or I drive to

03:54

Baltimore it's 250 miles away and we

03:57

would spend a weekend together working

03:58

on boltimore machines that was a

04:00

wonderful collaboration we were both

04:01

convinced it was how the brain worked

04:03

that was the most exciting research I've

04:05

ever done and a lot of technical results

04:07

came out that were very interesting but

04:09

I think it's not how the brain works um

04:11

I also had a very good collaboration

04:13

with um Peter Brown who was a very good

04:17

statistician and he worked on speech

04:19

recognition at IBM and then he came as a

04:22

more mature student to kind melon just

04:24

to get a PhD um but he already knew a

04:27

lot he taught me a lot about spee

04:30

and he in fact taught me about hidden

04:31

Markov models I think I learn more from

04:33

him than he learned from me that's the

04:35

kind of student you want and when he Tau

04:38

me about hidden Markov models I was

04:41

doing back propop with hidden layers

04:43

only they weren't called hidden layers

04:44

then and I decided that name they use in

04:47

Hidden Markov models is a great name for

04:49

variables that you don't know what

04:50

they're up to um and so that's where the

04:54

name hidden in neur NS came from me and

04:57

P decided that was a great name for the

04:59

hidden hidden L and your all Nets um but

05:03

I learned a lot from Peter about speech

05:05

take us back to when Ilia showed up at

05:08

your at your office I was in my office I

05:11

probably on a Sunday um and I was

05:14

programming I think and there was a

05:16

knock on the door not just any knock but

05:17

it won't

05:19

cutter it's sort of an urgent knock so I

05:21

went and answer to the door and this was

05:23

this young student there and he said he

05:25

was cooking Fries over the summer but

05:27

he'd rather be working in my lab and so

05:29

I said well why don't you make an

05:30

appointment and we'll talk and so Ilia

05:32

said how about now and that sort of was

05:35

Ila's character so we talked for a bit

05:38

and I gave him a paper to read which was

05:40

the nature paper on back

05:42

propagation and we made another meeting

05:45

for a week later and he came back and he

05:47

said I didn't understand it and I was

05:49

very disappointed I thought he seemed

05:50

like a bright guy but it's only the

05:52

chain rule it's not that hard to

05:54

understand and he said oh no no I

05:56

understood that I just don't understand

05:58

why you don't give the gradient to a

06:00

sensal a sensible function

06:02

Optimizer which took us quite a few

06:04

years to think about um and it kept on

06:07

like that with a he had very good his

06:09

raw intuitions about things were always

06:11

very good what do you think had enabled

06:14

those uh those intuitions for for Ilia I

06:17

don't know I think he always thought for

06:19

himself he was always interested in AI

06:21

from a young age um he's obviously good

06:24

at math so but it's very hard to know

06:27

and what was that collaboration between

06:29

between the two of you like what part

06:32

would you play and what part would Ilia

06:34

play it was a lot of fun um I remember

06:37

one occasion when we were trying to do a

06:41

complicated thing with producing maps of

06:43

data where I had a kind of mixture model

06:46

so you could take the same bunch of

06:47

similarities and make two maps so that

06:50

in one map Bank could be close to Greed

06:52

and in another map Bank could be close

06:54

to River um cuz in one map you can't

06:57

have it close to both right cuz River

06:59

and greed along wayon so we'd have a

07:01

mixture maps and we were doing it in mat

07:05

lab and this involved a lot of

07:06

reorganization of the code to do the

07:08

right Matrix multiplies and only got fed

07:10

up with that so he came one day and said

07:12

um I'm going to write a an interface for

07:15

Matlab so I program in this different

07:17

language and then I have something that

07:19

just converts it into Matlab and I said

07:21

no Ilia um that'll take you a month to

07:24

do we've got to get on with this project

07:26

don't get diverted by that and I said

07:28

it's okay I did it this

07:32

morning and that's that's quite quite

07:34

incredible and throughout those those

07:37

years the biggest shift wasn't

07:40

necessarily just the the algorithms but

07:42

but also the the skill how did you sort

07:45

of view that skill uh over over the

07:49

years Ilia got that intuition very early

07:51

so Ilia was always preaching that um you

07:55

just make it bigger and it'll work

07:56

better and I always thought that was a

07:58

bit of a copout do you going to have to

07:59

have new ideas too it turns out I was

08:02

basically right new ideas help things

08:04

like Transformers helped a lot but it

08:06

was really the scale of the data and the

08:09

scale of the computation and back then

08:11

we had no idea computers would get like

08:13

a billion times faster we thought maybe

08:15

they' get a 100 times faster we were

08:17

trying to do things by coming up with

08:19

clever ideas that would have just solved

08:21

themselves if we had had bigger scale of

08:22

the data and computation in about

08:25

2011 Ilia and another graduate student

08:28

called James Martins and

08:30

had a paper using character level

08:32

prediction so we took Wikipedia and we

08:35

tried to predict the next HTML character

08:39

and that worked remarkably well and we

08:41

were always amazed at how well it worked

08:43

and that was using a fancy Optimizer on

08:47

gpus and we could never quite believe

08:50

that it understood anything but it

08:52

looked as though it

08:53

understood and that just seemed

08:55

incredible can you take us through how

08:58

are do models trained to predict the

09:01

next word and why is it the wrong way of

09:06

of thinking about them okay I don't

09:08

actually believe it is the wrong way so

09:12

in fact I think I made the first

09:13

neuronet language model that used

09:15

embeddings and back propagation so it's

09:18

very simple data just

09:19

triples and it was turning each symbol

09:23

into an embedding then having the

09:25

embeddings interact to predict the

09:27

embedding of the next symbol and from

09:29

that predic the next symbol and then it

09:31

was back propagating through that whole

09:32

process to learn these triples and I

09:35

showed it could generalize um about 10

09:38

years later Yoshua Benji used a very

09:40

similar Network and showed it work with

09:41

real text and about 10 years after that

09:44

linguist started believing in embeddings

09:46

it was a slow process the reason I think

09:49

it's not just predicting the next symbol

09:52

is if you ask well what does it take to

09:54

predict the next symbol particularly if

09:56

you ask me a question and then the first

09:59

word of the answer is the next symbol um

10:03

you have to understand the question so I

10:06

think by predicting the next

10:08

symbol it's very unlike oldfashioned

10:11

autocomplete oldfashioned autocomplete

10:13

you'd store sort of triples of words and

10:16

then if you sort a pair of words you see

10:18

how often different words came third and

10:20

that way you can predict the next symbol

10:22

and that's what most people think auto

10:23

complete is like it's no longer at all

10:26

like that um to predict the next symbol

10:28

you have to understand what's been said

10:30

so I think you're forcing it to

10:31

understand by making it predict the next

10:33

symbol and I think it's understanding in

10:36

much the same way we are so a lot of

10:38

people will tell you these things aren't

10:40

like us um they're just predicting the

10:42

next symbol they're not reasoning like

10:44

us but actually in order to predict the

10:47

next symbol it's have going to have to

10:48

do some reasoning and we've seen now

10:50

that if you make big ones without

10:52

putting in any special stuff to do

10:53

reasoning they can already do some

10:55

reasoning and I think as you make them

10:57

bigger they're going to be able to do

10:58

more and more reasoning do you think I'm

11:00

doing anything else than predicting the

11:01

next symbol right now I think that's how

11:04

you're learning I think you're

11:06

predicting the next video frame um

11:08

you're predicting the next sound um but

11:11

I think that's a pretty plausible theory

11:13

of how the brain's learning what enables

11:16

these models to learn such a wide

11:19

variety of of fields what these big

11:21

language models are doing is they

11:23

looking for common structure and by

11:25

finding common structure they can encode

11:27

things using the common structure and

11:29

that more efficient so let me give you

11:31

an example if you ask

11:33

gp4 why is a compost heap like an atom

11:36

bomb most people can't answer that most

11:39

people haven't thought they think atom

11:41

bombs and compost heeps are very

11:42

different things but gp4 will tell you

11:44

well the energy scales are very

11:46

different and the time scales are very

11:48

different but the thing that's the same

11:51

is that when the compost Heep gets

11:52

hotter it generates heat faster and when

11:55

the atom bomb produces more NE neutrons

11:57

it produces more neutrons faster

12:00

and so it gets the idea of a chain

12:02

reaction and I believe it's understood

12:04

they're both forms of chain reaction

12:06

it's using that understanding to

12:08

compress all that information into its

12:09

weights and if it's doing that then it's

12:13

going to be doing that for hundreds of

12:15

things where we haven't seen the

12:16

analogies yet but it has and that's

12:18

where you get creativity from from

12:20

seeing these analogies between

12:21

apparently very different things and so

12:23

I think gp4 is going to end up when it

12:25

gets bigger being very creative I think

12:27

this idea that it's just just

12:29

regurgitating what it's learned just

12:31

pasing together text it's learned

12:33

already that's completely wrong it's

12:35

going to be even more creative than

12:37

people I think you'd argue that it won't

12:40

just repeat the human knowledge we've

12:43

developed so far but could also progress

12:46

beyond that I think that's something we

12:48

haven't quite seen yet we've started

12:51

seeing some examples of it but to a to a

12:53

large extent we're sort of still at the

12:56

current level of of of science what do

12:58

you think will enable it to go beyond

13:00

that well we've seen that in more

13:01

limited context like if you take Alpha

13:04

go in that famous competition with Leo

13:08

um there was move 37 where Alpha go made

13:11

a move that all the experts said must

13:13

have been a mistake but actually later

13:15

they realized it was a brilliant move um

13:18

so that was created within that limited

13:20

domain um I think we'll see a lot more

13:22

of that as these things get bigger the

13:25

difference with alphao as well was that

13:28

it was using reinforcement learning that

13:31

that subsequently sort of enabled it to

13:33

to go beyond the current state so it

13:35

started with imitation learning watching

13:37

how humans play the game and then it

13:39

would through selfplay develop Way

13:42

Beyond that do you think that's the

13:43

missing component of the I think that

13:46

may well be a missing component yes that

13:48

the the self-play in Alpha in Alpha go

13:51

and Alpha zero are are a large part of

13:54

why it could make these creative moves

13:56

but I don't think it's entirely

13:58

necessary

13:59

so there's a little experiment I did a

14:01

long time ago where you your training in

14:03

neuronet to recognize handwritten digits

14:06

I love that example the mest example and

14:09

you give it training data where half the

14:11

answers are

14:12

wrong um and the question is how well

14:15

will it

14:17

learn and you make half the answers

14:20

wrong once and keep them like that so it

14:23

can't average away the wrongness by just

14:25

seeing the same example but with the

14:27

right answer sometimes and the wrong

14:28

answer sometimes when it sees that

14:29

example half half of the examples when

14:32

it sees the example the answer is always

14:33

wrong and so the training data has 50%

14:37

error but if you train up back

14:40

propagation it gets down to 5% error or

14:44

less other words from badly labeled data

14:49

it can get much better results it can

14:51

see that the training data is wrong and

14:54

that's how smart students can be smarter

14:55

than their advisor and their advisor

14:57

tells them all this stuff

14:59

and for half of what their advisor tells

15:01

them they think no rubbish and they

15:03

listen to the other half and then they

15:05

end up smarter than the advisor so these

15:06

big neural Nets can actually do they can

15:09

do much better than their training data

15:11

and most people don't realize that so

15:13

how how do you expect this models to add

15:16

reasoning in into them so I mean one

15:19

approach is you add sort of the

15:20

heuristics on on top of them which a lot

15:23

of the research is doing now where you

15:25

have sort of Shan of thought you just

15:26

feedback it's reasoning um in into

15:29

itself and another way would be in the

15:32

model itself as you scale scale scale it

15:34

up what's your intuition around that so

15:38

my intuition is that as we scale up

15:40

these models I get better at reasoning

15:42

and if you ask how people work roughly

15:44

speaking we have these

15:47

intuitions and we can do reasoning and

15:50

we use the reasoning to correct our

15:52

intuitions of course we use the

15:54

intuitions during the reasoning to do

15:55

the reasoning but if the conclusion of

15:57

the reasoning conflicts with our in

15:58

itions we realize the intuitions need to

16:00

be changed that's much like in Alpha go

16:03

or Alpha zero where you have an

16:06

evaluation function um that just looks

16:09

at a board and says how good is that for

16:10

me but then you do the Monte Cara roll

16:13

out and now you get a more accurate idea

16:17

and you can revise your evaluation

16:18

function so you can train it by getting

16:20

it to agree with the results of

16:22

reasoning and I think these large

16:23

language models have to start doing that

16:26

they have to start training their Raw

16:28

intuitions about what should come next

16:30

by doing reasoning and realizing that's

16:32

not right and so that way they can get

16:35

more training data than just mimicking

16:37

what people did and that's exactly why

16:40

alphao could do this creative move 37 it

16:43

had much more training data because it

16:44

was using reasoning to check out what

16:47

the right next move should have been and

16:49

what do you think about multimodality so

16:52

we spoke about these analogies and often

16:54

the analogies are Way Beyond what we

16:56

could see it's discovering analogy that

16:59

are far beyond humans and at maybe

17:01

abstraction levels that we'll never be

17:03

able to to to understand now when we

17:06

introduce images to that and and video

17:09

and sound how do you think that will

17:11

change the models and uh how do you

17:14

think it will change the analogies that

17:16

it will be able to make um I think it'll

17:19

change it a lot I think it'll make it

17:21

much better at understanding spatial

17:23

things for example from language alone

17:26

it's quite hard to understand some

17:27

spatial things although remarkably gp4

17:30

can do that even before it was

17:32

multimodal um but when you make it

17:35

multimodal if you have it both doing

17:38

vision and reaching out and grabbing

17:40

things it'll understand object much

17:42

better if it can pick them up and turn

17:44

them over and so on so although you can

17:47

learn an awful lot from language it's

17:50

easier to learn if you multimodal and in

17:53

fact you then need less language and

17:55

there's an awful lot of YouTube video

17:57

for predicting the next frame so or

17:59

something like that so I think these

18:01

multimodule models are clearly going to

18:03

take over um you can get more data that

18:06

way they need less language so there's

18:08

really a philosophical point that you

18:10

could learn a very good model from

18:12

language alone but it's much easier to

18:14

learn it from a multimodal system and

18:16

how do you think it will impact the

18:18

model's reasoning I think it'll make it

18:21

much better at reasoning about space for

18:22

example reasoning about what happens if

18:24

you pick objects up if you actually try

18:26

picking objects up you're going to get

18:27

all sorts of training data that's going

18:29

to help do you think the human brain

18:32

evolved to work well with with language

18:35

or do you think language evolved to work

18:37

well with the human brain I think the

18:40

question of whether language evolved to

18:41

work with the brain or the brain evolved

18:43

to work with language I think that's a

18:44

very good question I think both happened

18:48

I used to think we would do a lot of

18:50

cognition without needing language at

18:52

all um now I've changed my mind a bit so

18:57

let me give you three different views of

18:59

language um and how it relates to

19:01

cognition there's the oldfashioned

19:03

symbolic view which is cognition

19:05

consists of having strings of symbols in

19:10

some kind of cleaned up logical language

19:12

where there's no ambiguity and applying

19:14

rules of inference and that's what

19:15

cognition is it's just these symbolic

19:17

manipulations on things that are like

19:19

strings of language symbols um so that's

19:22

one extreme view an opposite extreme

19:24

view is no no once you get inside the

19:27

head it's all vectors so symbols come in

19:30

you convert those symbols into big

19:32

vectors and all the stuff inside's done

19:34

with big vectors and then if you want to

19:36

produce output you produce symbols again

19:38

so there was a point in machine

19:40

translation in about

19:42

2014 when people were using neural

19:44

recurrent neural Nets and words will

19:46

keep coming in and that have a hidden

19:48

State and they keep accumulating

19:50

information in this hidden state so when

19:52

they got to the end of a sentence that

19:55

have a big hidden Vector that captures

19:56

the meaning of that sentence that could

19:59

then be used for producing the sentence

20:00

in another language that was called a

20:02

thought vector and that's a sort of

20:04

second view of language you convert the

20:05

language into a big Vector that's

20:08

nothing like language and that's what

20:10

cognition is all about but then there's

20:12

a third view which is what I believe now

20:15

which is that you take these

20:20

symbols and you convert the symbols into

20:23

embeddings and you use multiple layers

20:25

of that so you get these very rich

20:26

embeddings but the embeddings are still

20:28

to the symbols in the sense that you've

20:30

got a big Vector for this symbol and a

20:31

big Vector for that symbol and these

20:34

vectors interact to produce the vector

20:36

for the symbol for the next word and

20:39

that's what understanding is

20:40

understanding is knowing how to convert

20:42

the symbols into these vectors and

20:44

knowing how the elements of the vector

20:45

should interact to predict the vector

20:47

for the next symbol that's what

20:49

understanding is both in these big

20:50

language models and in our

20:52

brains and that's an example which is

20:55

sort of in between you're staying with

20:57

the symbols but you're interpreting them

21:00

as these big vectors and that's where

21:02

all the work is and all the knowledge is

21:04

in what vectors you use and how the

21:06

elements of those vectors interact not

21:08

in symbolic

21:09

rules um but it's not saying that you

21:13

get away from the symbols all together

21:14

it's saying you turn the symbols into

21:16

big vectors but you stay with that

21:18

surface structure of the symbols and

21:20

that's how these models are working and

21:22

that's I seem to be a more plausible

21:24

model of human thought too you were one

21:26

of the first folks to get idea of using

21:30

gpus and I know yansen loves you for

21:34

that uh back in 2009 you mentioned that

21:36

you told yansen that this could be a

21:38

quite good idea um for for training

21:41

training neural Nets take us back to

21:43

that early intuition of of using gpus

21:46

for for training neural Nets so actually

21:48

I think in about

21:50

2006 I had a former graduate student

21:53

called Rick zisy who's a very good

21:55

computer vision guy and I talked to him

21:58

and a meeting and he said you know you

22:00

ought to think about using Graphics

22:02

processing cards because they're very

22:03

good at Matrix multiplies and what

22:05

you're doing is basically all matric

22:07

multiplies so I thought about that for a

22:09

bit and then we learned about these

22:11

Tesla systems that had um four gpus in

22:16

and initially we just got um gaming gpus

22:21

and discovered they made things go 30

22:22

times faster and then we bought one of

22:24

these Tesla systems with 4 gpus and we

22:27

did speech on that and it worked very

22:30

well then in 2009 I gave a talk at nips

22:34

and I told a thousand machine learning

22:36

researches you should all go and buy

22:37

Nvidia gpus they're the future you need

22:39

them for doing machine learning and I

22:42

actually um then sent mail to Nvidia

22:45

saying I told a thousand machine

22:46

learning researchers to buy your boards

22:48

could you give me a free one and they

22:49

said no actually they didn't say no they

22:51

just didn't reply um but when I told

22:54

Jensen this story later on he gave me a

22:55

free

22:57

one that's uh that's very very good I I

23:00

think what's interesting is um as well

23:02

is sort of how gpus has evolved

23:05

alongside the the field so where where

23:07

do you think we we should go go next in

23:10

in the in the compute so my last couple

23:13

of years at Google I was thinking about

23:15

ways of trying to make analog

23:17

computation so that instead of using

23:19

like a megawatt we could use like 30

23:21

Watts like the brain and we could run

23:23

these big language models in analog

23:26

hardware and I never made it

23:29

work and but I started really

23:32

appreciating digital computation so if

23:36

you're going to use that low power

23:38

analog

23:39

computation every piece of Hardware is

23:41

going to be a bit different and the idea

23:43

is the learning is going to make use of

23:45

the specific properties of that hardware

23:47

and that's what happens with people all

23:49

our brains are different um so we can't

23:52

then take the weights in your brain and

23:54

put them in my brain the hardware is

23:56

different the precise properties of the

23:58

individual ual neurons are different the

23:59

learning used to make has learned to

24:01

make use of all that and so we're mortal

24:04

in the sense that the weights in my

24:05

brain are no good for any other brain

24:07

when I die those weights are useless um

24:10

we can get information from one to

24:12

another rather

24:13

inefficiently by I produce sentences and

24:16

you figure out how to change your weight

24:18

so you would have said the same thing

24:20

that's called distillation but that's a

24:22

very inefficient way of communicating

24:24

knowledge and with digital systems

24:27

they're immortal because once you got

24:29

some weights you can throw away the

24:31

computer just store the weights on a

24:32

tape somewhere and now build another

24:34

computer put those same weights in and

24:36

if it's digital it can compute exactly

24:39

the same thing as the other system did

24:41

so digital systems can share weights and

24:45

that's incredibly much more efficient if

24:48

you've got a whole bunch of digital

24:50

systems and they each go and do a tiny

24:51

bit of

24:52

learning and they start with the same

24:54

weights they do a tiny bit of learning

24:56

and then they share their weights again

24:58

um they all know what all the others

24:59

learned we can't do that and so they're

25:03

far superior to us in being able to

25:04

share knowledge a lot of the ideas that

25:07

have been deployed in the field are very

25:10

old school ideas uh it's the ideas that

25:13

have been around the Neuroscience for

25:15

forever what do you think is sort of

25:17

left to to to apply to the systems that

25:19

we develop so one big thing that we

25:23

still have to catch up with Neuroscience

25:26

on is the time scales for changes so in

25:31

nearly all the neural Nets there's a

25:34

fast time scale for changing activities

25:35

so input comes in the activities the

25:38

embedding vectors all change and then

25:40

there's a slow time scale which is

25:41

changing the weights and that's

25:43

long-term learning and you just have

25:45

those two time scales in the brain

25:48

there's many time scales at which

25:49

weights change so for example if I say

25:53

an unexpected word like cucumber and now

25:56

5 minutes later you put headphones on

25:58

there's a lot of noise and there's very

26:00

faint words you'll be much better at

26:03

recognizing the word cucumber because I

26:05

said it 5 minutes ago so where is that

26:08

knowledge in the brain and that

26:10

knowledge is obviously in temporary

26:12

changes to synapsis it's not neurons are

26:14

going cucumber cucumber cucumber you

26:16

don't have enough neurons for that it's

26:18

in temporary changes to the weights and

26:21

you can do a lot of things with

26:22

temporary weight changes fast what I

26:24

call fast weights we don't do that in

26:26

these neural models and the reason we

26:28

don't do it is because if you have

26:31

temporary changes to the weights that

26:33

depend on the input data then you can't

26:37

process a whole bunch of different cases

26:38

at the same time at present we take a

26:41

whole bunch of different strings we

26:43

stack them stack them together and we

26:45

process them all in parallel because

26:47

then we can do Matrix Matrix multiplies

26:48

which is much more efficient and just

26:51

that efficiency is stopping us using

26:53

fast weights but the brain clearly uses

26:56

fast weights for temporary memory and

26:59

there's all sorts of things you can do

27:00

that way that we don't do at present I

27:02

think that's one of the biggest things

27:03

we have to learn I was very hopeful that

27:04

things like graph core um if they went

27:08

sequential and did just online learning

27:11

then they could use fast weights

27:13

um but that hasn't worked out yet I

27:16

think it'll work out eventually when

27:18

people are using conductances for

27:19

weights how has knowing how this models

27:23

work and knowing how the brain works

27:26

impacted the way you you think I think

27:29

there's been one big impact which is at

27:33

a fairly abstract level which is that

27:35

for many

27:37

years people were very scornful about

27:40

the idea of having a big random neural

27:42

net and just giving a lot of training

27:44

data and it would learn to do

27:46

complicated things if you talk to

27:47

statisticians or linguists or most

27:50

people in AI they say that's just a pipe

27:53

dream there's no way you're going to

27:54

learn to really complicated things

27:56

without some kind of innate knowledge

27:57

without a lot of architectural

27:59

restrictions it turns out that's

28:00

completely wrong you can take a big

28:03

random neural network and you can learn

28:04

a whole bunch of stuff just from data um

28:08

so the idea that stochastic gradient

28:10

descent to adjust the repeatedly adjust

28:13

the weights using a gradient that will

28:16

learn things and we'll learn big

28:17

complicated things that's been validated

28:21

by these big models and that's a very

28:23

important thing to know about the brain

28:25

it doesn't have to have all this innate

28:27

structure now obviously it's got a lot

28:28

of innate structure but it certainly

28:32

doesn't need innate structure for things

28:33

that are easily

28:35

learned and so the sort of idea coming

28:37

from Chomsky that you won't you won't

28:39

learn anything complicated like language

28:41

unless it's all kind of wired in already

28:43

and just matures that idea is now

28:46

clearly nonsense I'm sure shumsky would

28:49

appreciate you calling his ideas

28:51

nonsense well I think actually I think a

28:54

lot of chs's political ideas are very

28:56

sensible and I'm was struck by how how

28:59

come someone with such sensible ideas

29:00

about the Middle East could be so wrong

29:02

about

29:03

Linguistics what do you think would make

29:05

these models simulate consciousness of

29:09

of humans more effectively but imagine

29:12

you had the AI assistant that you've

29:14

spoken to in your entire life and

29:16

instead of that being you know like chat

29:19

today that sort of deletes the memory of

29:21

the conversation and you start fresh all

29:23

of the time okay it had

29:26

self-reflection at some point you you

29:28

pass away and you tell that to to the

29:32

assistant do you think I me not me

29:35

somebody else tells that toist yeah you

29:38

would it would be difficult for you to

29:39

tell that to the assistant um do you

29:42

think that assistant would would feel at

29:44

that point yes I think they can have

29:46

feelings too so I think just as we have

29:49

this inner theater model for perception

29:51

we have an inthat model for feelings

29:53

they're things that I can experience but

29:55

other people can't um

29:59

I think that model is equally wrong so I

30:02

think suppose I say I feel like punching

30:04

Gary on the nose which I often do let's

30:07

try and Abstract that away from the idea

30:10

of an inner theater what I'm really

30:12

saying to you is um if it weren't for

30:16

the inhibition coming from my frontal

30:17

loes I would perform an action so when

30:20

we talk about feelings we really talking

30:22

about um actions we would perform if it

30:25

weren't for um con straints and that

30:29

really that's really what feelings are

30:31

the actions we would do if it weren't

30:32

for

30:33

constraints um so I think you can give

30:36

the same kind of explanation for

30:37

feelings and there's no reason why these

30:39

things can't have feelings in fact in

30:42

1973 I saw a robot having an emotion so

30:46

in Edinburgh they had a robot with two

30:49

grippers like this that could assemble a

30:51

toy car if you put the pieces separately

30:54

on a piece of green felt um but if you

30:58

put them in a pile his vision wasn't

31:01

good enough to figure out what was going

31:02

on so it put his grip whack and it

31:05

knocked them so they were scattered and

31:06

then it could put them together if you

31:08

saw that in a person you say it was

31:10

crossed with the situation because it

31:11

didn't understand it so it destroyed

31:13

it that's

31:16

profound you uh we spoke previously you

31:19

described sort of humans and and and and

31:22

the llms as analogy machines what do you

31:24

think has been the most powerful

31:27

analogies that you found throughout your

31:30

life oh in throughout my life um woo I

31:36

guess probably an a sort of weak analogy

31:40

that's influenced me a lot is um the

31:45

analogy between religious belief and

31:48

between belief in symbol

31:50

processing so when I was very young I

31:52

was confronted I came from an atheist

31:54

family and went to school and was

31:56

confronted with religious belief and it

31:58

just seemed nonsense to me it still

32:00

seems nonsense to me um and when I saw

32:03

symbol processing as an explanation how

32:04

people worked um I thought it was just

32:08

the same

32:10

nonsense I don't think it's quite so

32:12

much nonsense now because I think

32:15

actually we do do symbol processing it's

32:17

just we do it by giving these big

32:19

embedding vectors to the symbols but we

32:21

are actually symbol processing um but

32:24

not at all in the way people thought

32:25

where you match symbols and the only

32:27

thing is symbol has is it's identical to

32:29

another symbol or it's not identical

32:31

that's the only property a symbol has we

32:33

don't do that at all we use the context

32:35

to give embedding vectors to symbols and

32:37

then use the interactions between the

32:39

components of these embedding vectors to

32:41

do thinking but there's a very good

32:44

researcher at Google called Fernando

32:46

Pereira who said yes we do have symbolic

32:50

reasoning and the only symbolic we have

32:52

is natural language natural language is

32:54

a symbolic language and we reason with

32:55

it and I believe that now you've done

32:58

some of the most meaningful uh research

33:00

in the history of of computer science

33:03

can you walk us through like how do you

33:05

select the right problems to to work on

33:08

well first let me correct you me and my

33:11

students have done a lot of the most

33:12

meaningful things and it's mainly been a

33:15

very good collaboration with students

33:17

and my ability to select very good

33:19

students and that came from the fact

33:21

that were very few people doing neural

33:23

Nets in the 70s and 80s and 90s and

33:25

2000s and so the few people doing your

33:28

nets got to pick the very best students

33:30

so that was a piece of luck but my way

33:33

of selecting problems is

33:35

basically well you know when scientists

33:37

talk about how they work they have

33:40

theories about how they work which

33:41

probably don't have much to do with the

33:42

truth but my theory is that

33:45

I look for something where everybody's

33:48

agreed about something and it feels

33:50

wrong just there's a slight intuition

33:52

there's something wrong about it and

33:54

then I work on that and see if I can

33:56

elaborate why it is I think it's wrong

33:58

and maybe I can make a little demo with

34:00

a small computer program that shows that

34:04

it doesn't work the way you might expect

34:06

so let me take one example um most

34:09

people think that if you add noise to a

34:11

neural net is going to work worse um if

34:14

for example each time you put a training

34:16

example through

34:19

you make half of the neurons be silent

34:22

it'll work worse actually we know it'll

34:26

generalize better if you do that

34:28

and you can demonstrate that um in a

34:32

simple example that's what's nice about

34:34

computer simulation you can show you

34:36

know this idea you had that adding noise

34:38

is going to make it worse and sort of

34:39

dropping out half the neurons will make

34:41

it work worse which you will in the

34:42

short term but if you train it with like

34:45

that in the end it'll work better you

34:47

can demonstrate that with a small

34:48

computer program and then you can think

34:49

hard about why that is and how it stops

34:53

big elaborate co- adaptations um but

34:56

that I think that that's my method of

34:58

working find something that sounds

35:00

suspicious and work on it and see if you

35:03

can give a simple demonstration of why

35:05

it's wrong what sounds suspicious to you

35:07

now well that we don't use fast weight

35:10

sounds suspicious that we only have

35:12

these two time scales that's just wrong

35:14

that's not at all like the brain um and

35:17

in the long run I think we're going to

35:18

have to have many more time scans so

35:20

that's an example there and if you had

35:23

if you had your group of of students

35:25

today and they came to you and they said

35:26

so the Hamming question that we talked

35:27

about previously you know what's the

35:29

most important problem in in in your

35:31

field what would you suggest that they

35:33

take on and work on on next we spoke

35:36

about reasoning time scales what would

35:38

be sort of the highest priority Problem

35:40

that that you'd give them for me right

35:43

now it's the same question I've had for

35:45

the last like 30 years or so which is

35:48

does the brain do back propagation I

35:51

believe the brain is getting gradients

35:52

if you don't get gradients your learning

35:54

is just much worse than if you do get

35:56

gradients but how is the brain getting

35:58

gradients and is it

36:01

somehow implementing some approximate

36:03

version of back propagation or is it

36:04

some completely different technique

36:06

that's a big open question and if I kept

36:09

on doing research that's what I would be

36:11

doing research on and when you look back

36:13

at at your career now you've been right

36:16

about so many things but what were you

36:18

wrong about that you wish you sort of

36:20

spent less time pursuing a certain

36:23

direction okay those are two separate

36:25

questions one is what were you wrong

36:26

about and two do you wish you'd less

36:28

spent less time on it I think I was

36:31

wrong about Boltz machines and I'm glad

36:33

I spent a long time on it there are much

36:35

more beautiful theory of how you get

36:37

gradients than back propagation back

36:39

propagation is just ordinary and

36:40

sensible and it's just a chain rule B

36:42

machines is clever and it's a very

36:44

interesting way to get gradients and I

36:47

would love for that to be how the brain

36:49

works but I think it isn't did you spend

36:52

much time imagining what would happen

36:54

post the systems developing as as well

36:57

did you have an idea that okay if we

36:59

could make these systems work really

37:00

well we could you know democratize

37:02

education we could make knowledge way

37:04

more accessible um we could solve some

37:07

tough problems in in in medicine or was

37:10

it more to you about understanding the

37:13

Brin yes I I sort of feel scientists

37:17

ought to be doing things that are going

37:18

to help Society but actually that's not

37:22

how you do your best research you do

37:23

your best research when it's driven by

37:25

curiosity you just have to understand

37:28

something um much more recently I've

37:32

realized these things could do a lot of

37:33

harm as well as a lot of good and I've

37:35

become much more concerned about the

37:37

effects they're going to have on society

37:39

but that's not what was motivating me I

37:41

just wanted to understand how on Earth

37:42

can the brain learn to do things that's

37:45

what I want to know and I sort of failed

37:47

as a side effect of that failure we got

37:49

some nice engineering

37:51

but yeah it was a good good good failure

37:54

for the world if you take the lens of

37:56

the things that could go really right

37:59

what what do you think are the most

38:01

promising

38:02

applications I think Health Care is

38:05

clearly uh a big one um with Health Care

38:09

there's almost no end to how much Health

38:12

Care Society can absorb if you take

38:14

someone old they could use five doctors

38:18

fulltime um so when AI gets better than

38:21

people at doing things um you'd like it

38:25

to get better in areas where you could

38:27

do with a lot more of that stuff and we

38:30

could do with a lot more doctors if

38:32

everybody had three doctors of their own

38:33

that would be great and we're going to

38:35

get to that point um so that's one

38:38

reason why Healthcare is good there's

38:41

also just a new engineering developing

38:44

new materials for example for better

38:46

solar panels or for superc conductivity

38:49

or for just understanding how the Body

38:52

Works um there's going to be huge

38:55

impacts there those are all going to be

38:57

be good things what I worry about is Bad

39:00

actors using them for bad things we've

39:02

facilitated people like Putin or Z or

39:05

Trump

39:06

using AI for Killer Robots or for

39:10

manipulating public opinion or for Mass

39:12

surveillance and those are all very

39:14

worrying things are you ever concerned

39:17

that slowing down the field could also

39:20

slow down the positives oh absolutely

39:23

and I think there's not much chance that

39:26

the field will slow down partly because

39:29

it's International and if one country

39:31

slows down the other countries aren't

39:32

going to slow down so there's a race

39:35

clearly between China and the US and

39:37

neither is going to slow down so yeah I

39:39

don't I mean there was this partition

39:41

saying we should slow down for six

39:42

months I didn't sign it just because I

39:44

thought it was never going to happen I

39:46

maybe should have signed it because even

39:47

though it was never going to happen it

39:49

made a political point it's often good

39:51

to ask for things you know you can't get

39:53

just to make a point um but I didn't

39:55

think we're going to slow down and how

39:57

do you think that it will impact the AI

39:59

research process uh having uh this

40:03

assistance so I think it'll make it a

40:04

lot more efficient a research will get a

40:06

lot more efficient when you've got these

40:08

assistants that help you program um but

40:11

also help you think through things and

40:12

probably help you a lot with equations

40:14

too have you reflected much on the

40:17

process of selecting Talent has that

40:19

been mostly intuitive to you like when

40:22

Ilia shows up at the door you feel this

40:24

is smart guy let's work together so for

40:27

selecting Talent um sometimes you just

40:30

know so after talking to Ilia for not

40:32

very long he seemed very smart and then

40:35

talking him a bit more he clearly was

40:36

very smart and had very good intuitions

40:38

as well as being good at math so that

40:41

was a no-brainer there's another case

40:43

where I was at a NPS conference um we

40:47

had a poster and I someone came up and

40:50

he started asking questions about the

40:52

poster and every question he asked was a

40:54

sort of deep insight into what we'd done

40:56

wrong um and after 5 minutes I offered

40:59

him a postto position that guy was David

41:01

McKai who was just brilliant and it's

41:04

very sad he died but he was it was very

41:07

obvious you'd want him um other times

41:10

it's not so obvious and one thing I did

41:12

learn was that people are different

41:15

there's not just one type of good

41:17

student um so there's some students who

41:21

aren't that creative but are technically

41:23

extremely strong and will make anything

41:26

work there's other students who aren't

41:28

technically strong but are very creative

41:31

of course you want the ones who are both

41:32

but you don't always get that but I

41:34

think actually in the lab you need a

41:36

variety of different kinds of graduate

41:38

student but I still go with my gut

41:41

intuition that sometimes you talk to

41:43

somebody and they're just very very they

41:45

just get it and those are the ones you

41:48

want what do you think is the reason for

41:51

some folks having better intuition do

41:54

they just have better training data than

41:56

than others or how can you develop your

42:00

intuition I think it's partly they don't

42:03

stand for nonsense so here's a way to

42:06

get bad intuitions believe everything

42:08

you're told that's fatal you have to be

42:12

able to I think here's what some people

42:14

do they have a whole framework for

42:15

understanding reality and when someone

42:17

tells them something they try and sort

42:20

of figure out how that fits into their

42:22

framework and if it doesn't they just

42:24

reject it and that's a very good

42:28

strategy um people who try and

42:30

incorporate whatever they're told end up

42:33

with a framework that's sort of very

42:35

fuzzy and sort of can believe everything

42:38

and that's useless so I think actually

42:41

having a strong view of the world and

42:44

trying to manipulate incoming facts to

42:46

fit in with your view obviously it can

42:48

lead you into deep religious belief and

42:51

fatal flaws and so on like my belief in

42:53

boltzman machines um but I think that's

42:56

the way to go if you got good intuitions

42:58

you can trust you should trust them if

43:00

you got bad intuitions it doesn't matter

43:03

what you do so you might as well trust

43:05

them a very very good very good point

43:09

when when you look at the the types of

43:12

research that's that's that's being done

43:15

today do you think we're putting all of

43:17

our eggs in one basket and we should

43:19

diversify our ideas a bit more in in the

43:22

field or do you think this is the most

43:24

promising Direction so let's go all in

43:26

on it

43:28

I think having big models and training

43:30

them on multimodal data even if it's

43:33

only to predict the next word is such a

43:35

promising approach that we should go

43:37

pretty much all in on it obviously

43:39

there's lots and lots of people doing it

43:40

now and there's lots of people doing

43:42

apparently crazy things and that's good

43:45

um but I think it's fine for like most

43:47

of the people to be following this path

43:48

because it's working very well do you

43:50

think that the learning algorithms

43:54

matter that much or is it just a skill

43:56

are there basically millions of ways

43:59

that we could we could get to human

44:01

level in in intelligence or are there

44:03

sort of a select few that we need to

44:05

discover yes so this issue of whether

44:08

particular learning algorithms are very

44:10

important or whether there's a great

44:12

variety of learning algorithms that'll

44:13

do the job I don't know the answer it

44:16

seems to me though that back propagation

44:19

there's a sense in which it's the

44:20

correct thing to do getting the gradient

44:23

so that you change a parameter to make

44:24

it work better that seems like the right

44:27

thing to do and it's been amazingly

44:30

successful there may well be other

44:32

learning algorithms that are alternative

44:34

ways of getting that same gradient or

44:36

that are getting the gradient to

44:37

something else and that also work

44:40

um I think that's all open and a very

44:43

interesting issue now about whether

44:45

there's other things you can try and

44:47

maximize that will give you good systems

44:50

and maybe the brain's doing that because

44:51

it's

44:52

easier but backprop is in a sense the

44:55

right thing to do and we know that doing

44:57

it works really

44:59

well and one last question when when you

45:02

look back at your sort of Decades of

45:04

research what are you what are you most

45:05

proud of is it the students is it the

45:07

research what what makes you most proud

45:09

of when you look back at at your life's

45:11

work the learning algorithm for

45:14

boltimore machines so the learning

45:16

algorithm for Boltz machines is

45:17

beautifully elegant it's maybe hopeless

45:21

in practice um but it's the thing I

45:25

enjoyed most developing that with Terry

45:27

and it's what I'm proudest of um even if

45:31

it's

45:31

[Music]

45:36

wrong what questions do you spend most

45:39

of your time thinking about now is it

45:41

the um what what should I watch on

45:43

Netflix