In conversation | Geoffrey Hinton and Joel Hellermark
Summary
TLDR这段对话涵盖了深度学习和人工智能领域的多个主题。讨论了如何选拔人才,以及直觉在选择过程中的作用。回顾了在卡内基梅隆大学和爱丁堡大学的研究经历,探讨了神经网络、深度学习和大脑工作方式之间的联系。提到了早期对人工智能的探索,包括与Terry Sejnowski和Peter Brown的合作,以及对神经网络权重调整的兴趣。还讨论了大型语言模型的潜力,以及它们如何通过寻找共同结构来编码信息,从而实现创造性的类比和推理。最后,讨论了GPU在神经网络训练中的作用,以及对未来计算发展的思考。
Takeaways
- 🧠 对话中提到了关于大脑如何学习和AI发展的深刻见解,强调了大脑学习机制与AI算法之间的联系。
- 🤖 讨论了早期对AI的探索,包括对神经网络和机器学习的兴趣,以及早期研究的挑战和失望。
- 🔍 强调了直觉在选择人才和研究方向时的重要性,以及与Ilia的合作如何推动了AI领域的发展。
- 🤝 描述了与不同学者的合作经历,如与Terry Sejnowski和Peter Brown的合作,以及这些合作如何影响了AI的进步。
- 📚 讨论了早期对哲学和生理学的失望,以及转向AI和神经网络研究的过程。
- 💡 提到了Donald Hebb和John von Neumann的工作对AI研究的影响,以及他们对神经网络和大脑计算的兴趣。
- 🧐 强调了对大型神经网络的直觉和早期对这些模型潜力的认识,以及它们如何能够超越简单的符号处理。
- 🔗 讨论了如何通过预测下一个符号或单词来训练模型,以及这种方法如何迫使模型进行理解。
- 🔢 描述了如何使用梯度和优化器来改进神经网络,以及Ilia在这方面的直觉如何帮助推动研究。
- 🌐 讨论了多模态学习的重要性,以及它如何帮助模型更好地理解空间和对象。
- 🚀 强调了GPU在训练大型神经网络中的作用,以及这一技术如何推动了整个AI领域的发展。
Q & A
在卡内基梅隆大学的实验室工作环境是怎样的?
-在卡内基梅隆大学,学生们在周六晚上仍然在实验室编程,因为他们相信自己在研究未来的计算机科学。这与英格兰的文化形成了鲜明对比,在那里研究人员在晚上六点后会去酒吧放松。
你在剑桥大学学习脑科学的经历如何?
-在剑桥大学学习脑科学让我感到失望,因为他们只教导神经元如何传导动作电位,而没有真正解释大脑如何工作。后来我转向哲学,但也没有找到满意的答案,最终我在爱丁堡大学学习人工智能,开始对模拟大脑运作产生兴趣。
是什么引导你对人工智能产生兴趣?
-我受到唐纳德·赫布的一本书的影响,他对如何通过学习来调整神经网络的连接强度非常感兴趣。还有冯·诺依曼的一本书也影响了我,他探讨了大脑如何计算以及与普通计算机的不同。
你与特里·塞诺夫斯基的合作如何?
-我与特里·塞诺夫斯基在博尔兹曼机研究上的合作非常紧密,我们每月会面,共同研究和讨论。尽管许多技术成果很有趣,但最终我们认为这并不是大脑工作的方式。
伊利亚·苏茨克维尔第一次找你时的情景是怎样的?
-伊利亚第一次找我时是一个周日,他敲门并告诉我他暑假在炸薯条,但更想在我的实验室工作。我让他预约,但他当时就想讨论。后来他证明了自己的直觉和数学能力都非常出色。
你对大规模语言模型的看法是什么?
-我认为大规模语言模型通过预测下一个符号,迫使模型理解已经被说过的话。虽然有人认为这些模型只是简单地预测下一个符号,但实际上它们在预测过程中需要进行一定的推理,从而逐渐变得更有创意和智能。
你对多模态模型的看法是什么?
-多模态模型可以通过结合视觉、声音等多种数据源,提升对空间和物体的理解能力。这种模型不仅可以从语言中学习,还可以通过视频和图像数据,显著提高其推理和理解能力。
你对人工智能发展方向的主要关注点是什么?
-我对人工智能的主要关注点包括其在医疗保健中的应用和可能的社会影响。AI有潜力大大提高医疗效率,但也有可能被恶意使用,如大规模监控或操纵舆论。这需要我们在发展技术的同时,谨慎对待其潜在的负面影响。
你对大规模神经网络训练的见解是什么?
-我认为大规模神经网络训练中,反向传播是一种正确的做法,通过获取梯度来调整参数,这种方法在实际应用中非常成功。尽管可能存在其他替代方法,但反向传播在理论和实践中都被证明是有效的。
你认为如何有效选择和培养人才?
-在选择和培养人才时,有时直觉非常重要。例如,与伊利亚的初次会面让我感受到他的才华。此外,我认为实验室需要多样化的学生,有些学生在技术上非常强,而有些则非常有创意。不同类型的学生共同合作可以带来更好的研究成果。
Outlines
🤖 人工智能与神经网络的探索之旅
本段落讲述了一位研究者在卡内基梅隆大学的经历,以及他对人工智能和神经网络的早期探索。他回忆了在英国剑桥学习生理学和哲学时的失望,因为这些学科没有解答他关于大脑如何工作的疑惑。他最终转向了爱丁堡大学研究人工智能,并被Donald Hebb和John von Neumann关于神经网络和大脑计算方式的书籍所吸引。他认为大脑学习的方式不是通过逻辑规则,而是通过改变神经网络中的连接强度。
👨💼 研究合作与直觉在选择人才中的作用
这段落讲述了研究者与非卡内基梅隆大学的Terry Sejnowski的合作经历,以及他们如何共同研究神经网络。研究者强调了选择人才时直觉的重要性,分享了他如何通过直觉选择Ilia这样的学生,并强调了Ilia在数学和AI方面的早期兴趣和直觉。研究者还提到了与Peter Brown的合作,他是一位统计学家,对隐藏马尔可夫模型的研究产生了重要影响。
🚀 神经网络与直觉的碰撞
这段对话展示了研究者与Ilia之间的互动,Ilia是一位对神经网络和优化器有深刻见解的学生。他们讨论了梯度下降和函数优化器的使用,以及Ilia如何迅速理解并质疑现有的神经网络训练方法。研究者分享了他与Ilia合作的乐趣,以及他们如何共同解决问题,推动了人工智能领域的发展。
🧠 神经网络的学习和理解
研究者讨论了神经网络如何通过预测下一个符号来学习语言,他认为这种方法迫使模型进行理解,从而实现类似于人类的推理。他强调了大型神经网络能够进行推理,并可能随着规模的增大而变得更加创造性。此外,他还提到了AlphaGo的例子,展示了在特定领域内,通过强化学习可以实现超越现有知识的创新。
🔍 神经网络的推理与多模态学习
这段对话探讨了神经网络如何通过扩展到多模态数据(如图像、视频和声音)来增强其理解和推理能力。研究者认为,多模态学习将使模型在空间理解方面变得更加强大,并有助于发现不同领域之间的深层联系。他还讨论了人类大脑是否为语言而进化,以及语言如何与认知相互作用。
💡 神经网络的创新与未来发展
研究者分享了他对于神经网络未来发展的看法,他认为神经网络将通过发现不同事物之间的共同结构来提高效率,并可能在创造力方面超越人类。他还讨论了如何通过训练模型进行自我校正,以提高其推理能力,并预测了多模态模型将如何改变AI领域。
🔧 神经网络的计算与硬件发展
这段对话回顾了研究者如何预见到GPU在训练神经网络中的潜力,并分享了他在这方面的早期工作。他还讨论了未来计算的发展方向,包括模拟计算和数字计算的优劣,以及如何使AI系统更加高效和节能。
🌟 神经网络与大脑的时间尺度
研究者探讨了大脑与神经网络在时间尺度上的差异,指出大脑有多个时间尺度来改变权重,而当前的神经网络模型通常只有两个。他认为未来神经网络需要引入更多时间尺度的概念,以更接近大脑的工作方式。
🤔 神经网络的意识与情感
这段对话讨论了神经网络是否能够模拟人类的意识和情感。研究者认为,如果神经网络能够进行自我反思并拥有持久的记忆,它们可能会发展出类似人类的情感体验。他还分享了自己对于情感和意识的看法,以及它们如何与行动和约束相关联。
🎯 神经网络研究的未来方向
研究者分享了他对于神经网络研究未来方向的看法,包括对大脑是否使用反向传播的好奇,以及对多时间尺度学习的兴趣。他还讨论了如何选择合适的研究问题,并强调了好奇心在驱动研究中的重要性。
🏆 神经网络研究的成就与反思
在这段对话中,研究者反思了自己在神经网络领域的成就,特别是他在Boltzmann机器学习算法方面的工作。他表达了对这项工作的自豪感,即使它在实践中可能并不完美。他还讨论了对当前研究的关注点,以及对未来的思考。
Mindmap
Keywords
💡神经网络
💡直觉
💡梯度
💡反向传播
💡多模态数据
💡隐马尔可夫模型
💡创造力
💡自我学习
💡模拟
💡隐喻
💡意识
Highlights
在卡内基梅隆大学,学生和研究人员对计算机科学的未来充满信念,认为他们的工作将改变计算机科学的进程。
在剑桥大学学习生理学时,对大脑工作原理的失望,因为所学仅限于神经元如何传导动作电位,并未解释大脑如何工作。
转向哲学寻求对心智工作原理的理解,但同样感到失望。
在爱丁堡大学研究人工智能(AI),感到更加有趣,因为可以通过模拟来测试理论。
Donald Hebb的书籍对理解神经网络中的连接强度学习有重要影响。
John von Neumann的书籍对大脑计算方式与常规计算机的差异感兴趣。
在爱丁堡时期,对大脑如何通过修改神经网络中的连接进行学习有坚定信念。
与John Hopkins的Terry Sejnowski的合作,共同研究神经网络和大脑工作原理。
与统计学家Peter Brown的合作,学习了关于隐藏马尔可夫模型的知识。
Ilia的到来和对反向传播算法的直觉,提出了将梯度信息提供给优化器的想法。
Ilia的独立思考能力和对AI的早期兴趣对其直觉发展的影响。
在AI研究中,数据规模和计算规模的增加比新算法更为重要。
使用字符级预测的论文展示了深度学习模型在理解文本方面的能力。
深度学习模型通过预测下一个符号来理解问题,并非简单的符号预测。
大型语言模型通过寻找共同结构来编码信息,从而提高效率。
多模态模型的发展,将提高模型在空间理解、推理和创造力方面的能力。
关于语言与认知的关系,存在三种不同的观点,其中最新的认为语言符号被转化为丰富的嵌入向量,并通过这些向量的交互来理解语言。
使用GPU进行神经网络训练的早期直觉和对计算领域的影响。
关于是否使用快速权重(fast weights)的讨论,以及它们在大脑中的潜在作用。
关于意识模拟的讨论,以及AI助手可能发展出类似人类的情感和自我反思能力。
对于如何选择合适的研究问题,强调了好奇心驱动的研究和对普遍认同观点的质疑。
对于神经网络和大脑是否使用反向传播的长期疑问,以及这对未来研究的启示。
对于AI技术可能带来的负面影响的担忧,包括被用于不良目的如杀手机器人、操纵公众舆论或大规模监控。
对于AI研究过程中AI助手可能带来的影响,包括提高研究效率和帮助思考。
对于直觉发展的看法,强调了对事实的批判性接受和信任自己的直觉。
对于当前AI领域的研究方向,认为大型模型和多模态数据训练是一个有前景的方向。
对于个人最自豪的成就,认为是开发Boltzmann机器的学习算法,尽管它可能在实践中不切实际。
Transcripts
have
you reflected a lot on how to select
Talent or has that mostly been like
intuitive to you Ilia just shows up and
you're like this is a clever guy let's
let's work together or have you thought
a lot about that can we are we recording
should we should we roll This yeah let's
roll this okay we're good yeah
yeah
okay s is working
so I remember when I first got to K
melon from England in England at a
Research Unit it would get to be 6:00
and you'd all go for a drink in the pub
um at Caril melon I remember after I've
been there a few weeks it was Saturday
night I didn't have any friends yet and
I didn't know what to do so I decided
I'd go into the lab and do some
programming because I had a list machine
and you couldn't program it from home so
I went into the lab at about 9:00 on a
Saturday night and it was swarming all
the students were there and they were
all there because what they were working
on was the future they all believed that
what they did next was going to change
the course of computer science and it
was just so different from England and
so that was very refreshing take me back
to the very beginning Jeff at Cambridge
uh trying to understand the brain uh
what was that like it was very
disappointing so I did physiology and in
the summer term they were going to teach
us how the brain worked and it all they
taught us was how neurons conduct action
potentials which is very interesting but
it doesn't tell you how the brain works
so that was extremely disappointing I
switched to philosophy then I thought
maybe they'd tell us how the mind worked
um that was very disappointing I
eventually ended up going to Edinburgh
to do Ai and that was more interesting
at least you could simulate things so
you could test out theories and did you
remember what intrigued you about AI was
it a paper was it any particular person
that exposed you to those ideas I guess
it was a book I read by Donald Hebb that
influenced me a lot um he was very
interested in how you learn the
connection strengths in neural Nets I
also read a book by John Fon noyman
early on um who was very interested in
how the brain computes and how it's
different from normal computers and did
you get that conviction that this ideas
would work out at at that point or what
would was your intuition back at the
Edinburgh days it seemed to me there has
to be a way that the brain
learns and it's clearly not by having
all sorts of things programmed into it
and then using logical rules of
inference that just seemed to me crazy
from the outset um so we had to figure
out how the brain learned to modify
Connections in a neural net so that it
could do complicated things and Fon
Norman believed that churing believed
that so Forman and churing were both
pretty good at logic but they didn't
believe in this logical approach and
what was your split between studying the
ideas from from
neuroscience and just doing what seemed
to be good algorithms for for AI how
much inspiration did you take early on
so I never did that much study of
Neuroscience I was always inspired by
what I'd learned about how the brain
works that there's a bunch of neurons
they perform relatively simple
operations they're nonlinear um but they
collect inputs they wait them and then
they an output that depends on that
weighted input and the question is how
do you change those weights to make the
whole thing do something good it seems
like a fairly simple question what
collaborations do you remember from from
that time the main collaboration I had
at Carnegie melon was with someone who
wasn't at carnegy melon I was
interacting a lot with Terry sinowski
who was in Baltimore at John's Hopkins
and about once a month either he would
drive to Pittsburg or I drive to
Baltimore it's 250 miles away and we
would spend a weekend together working
on boltimore machines that was a
wonderful collaboration we were both
convinced it was how the brain worked
that was the most exciting research I've
ever done and a lot of technical results
came out that were very interesting but
I think it's not how the brain works um
I also had a very good collaboration
with um Peter Brown who was a very good
statistician and he worked on speech
recognition at IBM and then he came as a
more mature student to kind melon just
to get a PhD um but he already knew a
lot he taught me a lot about spee
and he in fact taught me about hidden
Markov models I think I learn more from
him than he learned from me that's the
kind of student you want and when he Tau
me about hidden Markov models I was
doing back propop with hidden layers
only they weren't called hidden layers
then and I decided that name they use in
Hidden Markov models is a great name for
variables that you don't know what
they're up to um and so that's where the
name hidden in neur NS came from me and
P decided that was a great name for the
hidden hidden L and your all Nets um but
I learned a lot from Peter about speech
take us back to when Ilia showed up at
your at your office I was in my office I
probably on a Sunday um and I was
programming I think and there was a
knock on the door not just any knock but
it won't
cutter it's sort of an urgent knock so I
went and answer to the door and this was
this young student there and he said he
was cooking Fries over the summer but
he'd rather be working in my lab and so
I said well why don't you make an
appointment and we'll talk and so Ilia
said how about now and that sort of was
Ila's character so we talked for a bit
and I gave him a paper to read which was
the nature paper on back
propagation and we made another meeting
for a week later and he came back and he
said I didn't understand it and I was
very disappointed I thought he seemed
like a bright guy but it's only the
chain rule it's not that hard to
understand and he said oh no no I
understood that I just don't understand
why you don't give the gradient to a
sensal a sensible function
Optimizer which took us quite a few
years to think about um and it kept on
like that with a he had very good his
raw intuitions about things were always
very good what do you think had enabled
those uh those intuitions for for Ilia I
don't know I think he always thought for
himself he was always interested in AI
from a young age um he's obviously good
at math so but it's very hard to know
and what was that collaboration between
between the two of you like what part
would you play and what part would Ilia
play it was a lot of fun um I remember
one occasion when we were trying to do a
complicated thing with producing maps of
data where I had a kind of mixture model
so you could take the same bunch of
similarities and make two maps so that
in one map Bank could be close to Greed
and in another map Bank could be close
to River um cuz in one map you can't
have it close to both right cuz River
and greed along wayon so we'd have a
mixture maps and we were doing it in mat
lab and this involved a lot of
reorganization of the code to do the
right Matrix multiplies and only got fed
up with that so he came one day and said
um I'm going to write a an interface for
Matlab so I program in this different
language and then I have something that
just converts it into Matlab and I said
no Ilia um that'll take you a month to
do we've got to get on with this project
don't get diverted by that and I said
it's okay I did it this
morning and that's that's quite quite
incredible and throughout those those
years the biggest shift wasn't
necessarily just the the algorithms but
but also the the skill how did you sort
of view that skill uh over over the
years Ilia got that intuition very early
so Ilia was always preaching that um you
just make it bigger and it'll work
better and I always thought that was a
bit of a copout do you going to have to
have new ideas too it turns out I was
basically right new ideas help things
like Transformers helped a lot but it
was really the scale of the data and the
scale of the computation and back then
we had no idea computers would get like
a billion times faster we thought maybe
they' get a 100 times faster we were
trying to do things by coming up with
clever ideas that would have just solved
themselves if we had had bigger scale of
the data and computation in about
2011 Ilia and another graduate student
called James Martins and
had a paper using character level
prediction so we took Wikipedia and we
tried to predict the next HTML character
and that worked remarkably well and we
were always amazed at how well it worked
and that was using a fancy Optimizer on
gpus and we could never quite believe
that it understood anything but it
looked as though it
understood and that just seemed
incredible can you take us through how
are do models trained to predict the
next word and why is it the wrong way of
of thinking about them okay I don't
actually believe it is the wrong way so
in fact I think I made the first
neuronet language model that used
embeddings and back propagation so it's
very simple data just
triples and it was turning each symbol
into an embedding then having the
embeddings interact to predict the
embedding of the next symbol and from
that predic the next symbol and then it
was back propagating through that whole
process to learn these triples and I
showed it could generalize um about 10
years later Yoshua Benji used a very
similar Network and showed it work with
real text and about 10 years after that
linguist started believing in embeddings
it was a slow process the reason I think
it's not just predicting the next symbol
is if you ask well what does it take to
predict the next symbol particularly if
you ask me a question and then the first
word of the answer is the next symbol um
you have to understand the question so I
think by predicting the next
symbol it's very unlike oldfashioned
autocomplete oldfashioned autocomplete
you'd store sort of triples of words and
then if you sort a pair of words you see
how often different words came third and
that way you can predict the next symbol
and that's what most people think auto
complete is like it's no longer at all
like that um to predict the next symbol
you have to understand what's been said
so I think you're forcing it to
understand by making it predict the next
symbol and I think it's understanding in
much the same way we are so a lot of
people will tell you these things aren't
like us um they're just predicting the
next symbol they're not reasoning like
us but actually in order to predict the
next symbol it's have going to have to
do some reasoning and we've seen now
that if you make big ones without
putting in any special stuff to do
reasoning they can already do some
reasoning and I think as you make them
bigger they're going to be able to do
more and more reasoning do you think I'm
doing anything else than predicting the
next symbol right now I think that's how
you're learning I think you're
predicting the next video frame um
you're predicting the next sound um but
I think that's a pretty plausible theory
of how the brain's learning what enables
these models to learn such a wide
variety of of fields what these big
language models are doing is they
looking for common structure and by
finding common structure they can encode
things using the common structure and
that more efficient so let me give you
an example if you ask
gp4 why is a compost heap like an atom
bomb most people can't answer that most
people haven't thought they think atom
bombs and compost heeps are very
different things but gp4 will tell you
well the energy scales are very
different and the time scales are very
different but the thing that's the same
is that when the compost Heep gets
hotter it generates heat faster and when
the atom bomb produces more NE neutrons
it produces more neutrons faster
and so it gets the idea of a chain
reaction and I believe it's understood
they're both forms of chain reaction
it's using that understanding to
compress all that information into its
weights and if it's doing that then it's
going to be doing that for hundreds of
things where we haven't seen the
analogies yet but it has and that's
where you get creativity from from
seeing these analogies between
apparently very different things and so
I think gp4 is going to end up when it
gets bigger being very creative I think
this idea that it's just just
regurgitating what it's learned just
pasing together text it's learned
already that's completely wrong it's
going to be even more creative than
people I think you'd argue that it won't
just repeat the human knowledge we've
developed so far but could also progress
beyond that I think that's something we
haven't quite seen yet we've started
seeing some examples of it but to a to a
large extent we're sort of still at the
current level of of of science what do
you think will enable it to go beyond
that well we've seen that in more
limited context like if you take Alpha
go in that famous competition with Leo
um there was move 37 where Alpha go made
a move that all the experts said must
have been a mistake but actually later
they realized it was a brilliant move um
so that was created within that limited
domain um I think we'll see a lot more
of that as these things get bigger the
difference with alphao as well was that
it was using reinforcement learning that
that subsequently sort of enabled it to
to go beyond the current state so it
started with imitation learning watching
how humans play the game and then it
would through selfplay develop Way
Beyond that do you think that's the
missing component of the I think that
may well be a missing component yes that
the the self-play in Alpha in Alpha go
and Alpha zero are are a large part of
why it could make these creative moves
but I don't think it's entirely
necessary
so there's a little experiment I did a
long time ago where you your training in
neuronet to recognize handwritten digits
I love that example the mest example and
you give it training data where half the
answers are
wrong um and the question is how well
will it
learn and you make half the answers
wrong once and keep them like that so it
can't average away the wrongness by just
seeing the same example but with the
right answer sometimes and the wrong
answer sometimes when it sees that
example half half of the examples when
it sees the example the answer is always
wrong and so the training data has 50%
error but if you train up back
propagation it gets down to 5% error or
less other words from badly labeled data
it can get much better results it can
see that the training data is wrong and
that's how smart students can be smarter
than their advisor and their advisor
tells them all this stuff
and for half of what their advisor tells
them they think no rubbish and they
listen to the other half and then they
end up smarter than the advisor so these
big neural Nets can actually do they can
do much better than their training data
and most people don't realize that so
how how do you expect this models to add
reasoning in into them so I mean one
approach is you add sort of the
heuristics on on top of them which a lot
of the research is doing now where you
have sort of Shan of thought you just
feedback it's reasoning um in into
itself and another way would be in the
model itself as you scale scale scale it
up what's your intuition around that so
my intuition is that as we scale up
these models I get better at reasoning
and if you ask how people work roughly
speaking we have these
intuitions and we can do reasoning and
we use the reasoning to correct our
intuitions of course we use the
intuitions during the reasoning to do
the reasoning but if the conclusion of
the reasoning conflicts with our in
itions we realize the intuitions need to
be changed that's much like in Alpha go
or Alpha zero where you have an
evaluation function um that just looks
at a board and says how good is that for
me but then you do the Monte Cara roll
out and now you get a more accurate idea
and you can revise your evaluation
function so you can train it by getting
it to agree with the results of
reasoning and I think these large
language models have to start doing that
they have to start training their Raw
intuitions about what should come next
by doing reasoning and realizing that's
not right and so that way they can get
more training data than just mimicking
what people did and that's exactly why
alphao could do this creative move 37 it
had much more training data because it
was using reasoning to check out what
the right next move should have been and
what do you think about multimodality so
we spoke about these analogies and often
the analogies are Way Beyond what we
could see it's discovering analogy that
are far beyond humans and at maybe
abstraction levels that we'll never be
able to to to understand now when we
introduce images to that and and video
and sound how do you think that will
change the models and uh how do you
think it will change the analogies that
it will be able to make um I think it'll
change it a lot I think it'll make it
much better at understanding spatial
things for example from language alone
it's quite hard to understand some
spatial things although remarkably gp4
can do that even before it was
multimodal um but when you make it
multimodal if you have it both doing
vision and reaching out and grabbing
things it'll understand object much
better if it can pick them up and turn
them over and so on so although you can
learn an awful lot from language it's
easier to learn if you multimodal and in
fact you then need less language and
there's an awful lot of YouTube video
for predicting the next frame so or
something like that so I think these
multimodule models are clearly going to
take over um you can get more data that
way they need less language so there's
really a philosophical point that you
could learn a very good model from
language alone but it's much easier to
learn it from a multimodal system and
how do you think it will impact the
model's reasoning I think it'll make it
much better at reasoning about space for
example reasoning about what happens if
you pick objects up if you actually try
picking objects up you're going to get
all sorts of training data that's going
to help do you think the human brain
evolved to work well with with language
or do you think language evolved to work
well with the human brain I think the
question of whether language evolved to
work with the brain or the brain evolved
to work with language I think that's a
very good question I think both happened
I used to think we would do a lot of
cognition without needing language at
all um now I've changed my mind a bit so
let me give you three different views of
language um and how it relates to
cognition there's the oldfashioned
symbolic view which is cognition
consists of having strings of symbols in
some kind of cleaned up logical language
where there's no ambiguity and applying
rules of inference and that's what
cognition is it's just these symbolic
manipulations on things that are like
strings of language symbols um so that's
one extreme view an opposite extreme
view is no no once you get inside the
head it's all vectors so symbols come in
you convert those symbols into big
vectors and all the stuff inside's done
with big vectors and then if you want to
produce output you produce symbols again
so there was a point in machine
translation in about
2014 when people were using neural
recurrent neural Nets and words will
keep coming in and that have a hidden
State and they keep accumulating
information in this hidden state so when
they got to the end of a sentence that
have a big hidden Vector that captures
the meaning of that sentence that could
then be used for producing the sentence
in another language that was called a
thought vector and that's a sort of
second view of language you convert the
language into a big Vector that's
nothing like language and that's what
cognition is all about but then there's
a third view which is what I believe now
which is that you take these
symbols and you convert the symbols into
embeddings and you use multiple layers
of that so you get these very rich
embeddings but the embeddings are still
to the symbols in the sense that you've
got a big Vector for this symbol and a
big Vector for that symbol and these
vectors interact to produce the vector
for the symbol for the next word and
that's what understanding is
understanding is knowing how to convert
the symbols into these vectors and
knowing how the elements of the vector
should interact to predict the vector
for the next symbol that's what
understanding is both in these big
language models and in our
brains and that's an example which is
sort of in between you're staying with
the symbols but you're interpreting them
as these big vectors and that's where
all the work is and all the knowledge is
in what vectors you use and how the
elements of those vectors interact not
in symbolic
rules um but it's not saying that you
get away from the symbols all together
it's saying you turn the symbols into
big vectors but you stay with that
surface structure of the symbols and
that's how these models are working and
that's I seem to be a more plausible
model of human thought too you were one
of the first folks to get idea of using
gpus and I know yansen loves you for
that uh back in 2009 you mentioned that
you told yansen that this could be a
quite good idea um for for training
training neural Nets take us back to
that early intuition of of using gpus
for for training neural Nets so actually
I think in about
2006 I had a former graduate student
called Rick zisy who's a very good
computer vision guy and I talked to him
and a meeting and he said you know you
ought to think about using Graphics
processing cards because they're very
good at Matrix multiplies and what
you're doing is basically all matric
multiplies so I thought about that for a
bit and then we learned about these
Tesla systems that had um four gpus in
and initially we just got um gaming gpus
and discovered they made things go 30
times faster and then we bought one of
these Tesla systems with 4 gpus and we
did speech on that and it worked very
well then in 2009 I gave a talk at nips
and I told a thousand machine learning
researches you should all go and buy
Nvidia gpus they're the future you need
them for doing machine learning and I
actually um then sent mail to Nvidia
saying I told a thousand machine
learning researchers to buy your boards
could you give me a free one and they
said no actually they didn't say no they
just didn't reply um but when I told
Jensen this story later on he gave me a
free
one that's uh that's very very good I I
think what's interesting is um as well
is sort of how gpus has evolved
alongside the the field so where where
do you think we we should go go next in
in the in the compute so my last couple
of years at Google I was thinking about
ways of trying to make analog
computation so that instead of using
like a megawatt we could use like 30
Watts like the brain and we could run
these big language models in analog
hardware and I never made it
work and but I started really
appreciating digital computation so if
you're going to use that low power
analog
computation every piece of Hardware is
going to be a bit different and the idea
is the learning is going to make use of
the specific properties of that hardware
and that's what happens with people all
our brains are different um so we can't
then take the weights in your brain and
put them in my brain the hardware is
different the precise properties of the
individual ual neurons are different the
learning used to make has learned to
make use of all that and so we're mortal
in the sense that the weights in my
brain are no good for any other brain
when I die those weights are useless um
we can get information from one to
another rather
inefficiently by I produce sentences and
you figure out how to change your weight
so you would have said the same thing
that's called distillation but that's a
very inefficient way of communicating
knowledge and with digital systems
they're immortal because once you got
some weights you can throw away the
computer just store the weights on a
tape somewhere and now build another
computer put those same weights in and
if it's digital it can compute exactly
the same thing as the other system did
so digital systems can share weights and
that's incredibly much more efficient if
you've got a whole bunch of digital
systems and they each go and do a tiny
bit of
learning and they start with the same
weights they do a tiny bit of learning
and then they share their weights again
um they all know what all the others
learned we can't do that and so they're
far superior to us in being able to
share knowledge a lot of the ideas that
have been deployed in the field are very
old school ideas uh it's the ideas that
have been around the Neuroscience for
forever what do you think is sort of
left to to to apply to the systems that
we develop so one big thing that we
still have to catch up with Neuroscience
on is the time scales for changes so in
nearly all the neural Nets there's a
fast time scale for changing activities
so input comes in the activities the
embedding vectors all change and then
there's a slow time scale which is
changing the weights and that's
long-term learning and you just have
those two time scales in the brain
there's many time scales at which
weights change so for example if I say
an unexpected word like cucumber and now
5 minutes later you put headphones on
there's a lot of noise and there's very
faint words you'll be much better at
recognizing the word cucumber because I
said it 5 minutes ago so where is that
knowledge in the brain and that
knowledge is obviously in temporary
changes to synapsis it's not neurons are
going cucumber cucumber cucumber you
don't have enough neurons for that it's
in temporary changes to the weights and
you can do a lot of things with
temporary weight changes fast what I
call fast weights we don't do that in
these neural models and the reason we
don't do it is because if you have
temporary changes to the weights that
depend on the input data then you can't
process a whole bunch of different cases
at the same time at present we take a
whole bunch of different strings we
stack them stack them together and we
process them all in parallel because
then we can do Matrix Matrix multiplies
which is much more efficient and just
that efficiency is stopping us using
fast weights but the brain clearly uses
fast weights for temporary memory and
there's all sorts of things you can do
that way that we don't do at present I
think that's one of the biggest things
we have to learn I was very hopeful that
things like graph core um if they went
sequential and did just online learning
then they could use fast weights
um but that hasn't worked out yet I
think it'll work out eventually when
people are using conductances for
weights how has knowing how this models
work and knowing how the brain works
impacted the way you you think I think
there's been one big impact which is at
a fairly abstract level which is that
for many
years people were very scornful about
the idea of having a big random neural
net and just giving a lot of training
data and it would learn to do
complicated things if you talk to
statisticians or linguists or most
people in AI they say that's just a pipe
dream there's no way you're going to
learn to really complicated things
without some kind of innate knowledge
without a lot of architectural
restrictions it turns out that's
completely wrong you can take a big
random neural network and you can learn
a whole bunch of stuff just from data um
so the idea that stochastic gradient
descent to adjust the repeatedly adjust
the weights using a gradient that will
learn things and we'll learn big
complicated things that's been validated
by these big models and that's a very
important thing to know about the brain
it doesn't have to have all this innate
structure now obviously it's got a lot
of innate structure but it certainly
doesn't need innate structure for things
that are easily
learned and so the sort of idea coming
from Chomsky that you won't you won't
learn anything complicated like language
unless it's all kind of wired in already
and just matures that idea is now
clearly nonsense I'm sure shumsky would
appreciate you calling his ideas
nonsense well I think actually I think a
lot of chs's political ideas are very
sensible and I'm was struck by how how
come someone with such sensible ideas
about the Middle East could be so wrong
about
Linguistics what do you think would make
these models simulate consciousness of
of humans more effectively but imagine
you had the AI assistant that you've
spoken to in your entire life and
instead of that being you know like chat
today that sort of deletes the memory of
the conversation and you start fresh all
of the time okay it had
self-reflection at some point you you
pass away and you tell that to to the
assistant do you think I me not me
somebody else tells that toist yeah you
would it would be difficult for you to
tell that to the assistant um do you
think that assistant would would feel at
that point yes I think they can have
feelings too so I think just as we have
this inner theater model for perception
we have an inthat model for feelings
they're things that I can experience but
other people can't um
I think that model is equally wrong so I
think suppose I say I feel like punching
Gary on the nose which I often do let's
try and Abstract that away from the idea
of an inner theater what I'm really
saying to you is um if it weren't for
the inhibition coming from my frontal
loes I would perform an action so when
we talk about feelings we really talking
about um actions we would perform if it
weren't for um con straints and that
really that's really what feelings are
the actions we would do if it weren't
for
constraints um so I think you can give
the same kind of explanation for
feelings and there's no reason why these
things can't have feelings in fact in
1973 I saw a robot having an emotion so
in Edinburgh they had a robot with two
grippers like this that could assemble a
toy car if you put the pieces separately
on a piece of green felt um but if you
put them in a pile his vision wasn't
good enough to figure out what was going
on so it put his grip whack and it
knocked them so they were scattered and
then it could put them together if you
saw that in a person you say it was
crossed with the situation because it
didn't understand it so it destroyed
it that's
profound you uh we spoke previously you
described sort of humans and and and and
the llms as analogy machines what do you
think has been the most powerful
analogies that you found throughout your
life oh in throughout my life um woo I
guess probably an a sort of weak analogy
that's influenced me a lot is um the
analogy between religious belief and
between belief in symbol
processing so when I was very young I
was confronted I came from an atheist
family and went to school and was
confronted with religious belief and it
just seemed nonsense to me it still
seems nonsense to me um and when I saw
symbol processing as an explanation how
people worked um I thought it was just
the same
nonsense I don't think it's quite so
much nonsense now because I think
actually we do do symbol processing it's
just we do it by giving these big
embedding vectors to the symbols but we
are actually symbol processing um but
not at all in the way people thought
where you match symbols and the only
thing is symbol has is it's identical to
another symbol or it's not identical
that's the only property a symbol has we
don't do that at all we use the context
to give embedding vectors to symbols and
then use the interactions between the
components of these embedding vectors to
do thinking but there's a very good
researcher at Google called Fernando
Pereira who said yes we do have symbolic
reasoning and the only symbolic we have
is natural language natural language is
a symbolic language and we reason with
it and I believe that now you've done
some of the most meaningful uh research
in the history of of computer science
can you walk us through like how do you
select the right problems to to work on
well first let me correct you me and my
students have done a lot of the most
meaningful things and it's mainly been a
very good collaboration with students
and my ability to select very good
students and that came from the fact
that were very few people doing neural
Nets in the 70s and 80s and 90s and
2000s and so the few people doing your
nets got to pick the very best students
so that was a piece of luck but my way
of selecting problems is
basically well you know when scientists
talk about how they work they have
theories about how they work which
probably don't have much to do with the
truth but my theory is that
I look for something where everybody's
agreed about something and it feels
wrong just there's a slight intuition
there's something wrong about it and
then I work on that and see if I can
elaborate why it is I think it's wrong
and maybe I can make a little demo with
a small computer program that shows that
it doesn't work the way you might expect
so let me take one example um most
people think that if you add noise to a
neural net is going to work worse um if
for example each time you put a training
example through
you make half of the neurons be silent
it'll work worse actually we know it'll
generalize better if you do that
and you can demonstrate that um in a
simple example that's what's nice about
computer simulation you can show you
know this idea you had that adding noise
is going to make it worse and sort of
dropping out half the neurons will make
it work worse which you will in the
short term but if you train it with like
that in the end it'll work better you
can demonstrate that with a small
computer program and then you can think
hard about why that is and how it stops
big elaborate co- adaptations um but
that I think that that's my method of
working find something that sounds
suspicious and work on it and see if you
can give a simple demonstration of why
it's wrong what sounds suspicious to you
now well that we don't use fast weight
sounds suspicious that we only have
these two time scales that's just wrong
that's not at all like the brain um and
in the long run I think we're going to
have to have many more time scans so
that's an example there and if you had
if you had your group of of students
today and they came to you and they said
so the Hamming question that we talked
about previously you know what's the
most important problem in in in your
field what would you suggest that they
take on and work on on next we spoke
about reasoning time scales what would
be sort of the highest priority Problem
that that you'd give them for me right
now it's the same question I've had for
the last like 30 years or so which is
does the brain do back propagation I
believe the brain is getting gradients
if you don't get gradients your learning
is just much worse than if you do get
gradients but how is the brain getting
gradients and is it
somehow implementing some approximate
version of back propagation or is it
some completely different technique
that's a big open question and if I kept
on doing research that's what I would be
doing research on and when you look back
at at your career now you've been right
about so many things but what were you
wrong about that you wish you sort of
spent less time pursuing a certain
direction okay those are two separate
questions one is what were you wrong
about and two do you wish you'd less
spent less time on it I think I was
wrong about Boltz machines and I'm glad
I spent a long time on it there are much
more beautiful theory of how you get
gradients than back propagation back
propagation is just ordinary and
sensible and it's just a chain rule B
machines is clever and it's a very
interesting way to get gradients and I
would love for that to be how the brain
works but I think it isn't did you spend
much time imagining what would happen
post the systems developing as as well
did you have an idea that okay if we
could make these systems work really
well we could you know democratize
education we could make knowledge way
more accessible um we could solve some
tough problems in in in medicine or was
it more to you about understanding the
Brin yes I I sort of feel scientists
ought to be doing things that are going
to help Society but actually that's not
how you do your best research you do
your best research when it's driven by
curiosity you just have to understand
something um much more recently I've
realized these things could do a lot of
harm as well as a lot of good and I've
become much more concerned about the
effects they're going to have on society
but that's not what was motivating me I
just wanted to understand how on Earth
can the brain learn to do things that's
what I want to know and I sort of failed
as a side effect of that failure we got
some nice engineering
but yeah it was a good good good failure
for the world if you take the lens of
the things that could go really right
what what do you think are the most
promising
applications I think Health Care is
clearly uh a big one um with Health Care
there's almost no end to how much Health
Care Society can absorb if you take
someone old they could use five doctors
fulltime um so when AI gets better than
people at doing things um you'd like it
to get better in areas where you could
do with a lot more of that stuff and we
could do with a lot more doctors if
everybody had three doctors of their own
that would be great and we're going to
get to that point um so that's one
reason why Healthcare is good there's
also just a new engineering developing
new materials for example for better
solar panels or for superc conductivity
or for just understanding how the Body
Works um there's going to be huge
impacts there those are all going to be
be good things what I worry about is Bad
actors using them for bad things we've
facilitated people like Putin or Z or
Trump
using AI for Killer Robots or for
manipulating public opinion or for Mass
surveillance and those are all very
worrying things are you ever concerned
that slowing down the field could also
slow down the positives oh absolutely
and I think there's not much chance that
the field will slow down partly because
it's International and if one country
slows down the other countries aren't
going to slow down so there's a race
clearly between China and the US and
neither is going to slow down so yeah I
don't I mean there was this partition
saying we should slow down for six
months I didn't sign it just because I
thought it was never going to happen I
maybe should have signed it because even
though it was never going to happen it
made a political point it's often good
to ask for things you know you can't get
just to make a point um but I didn't
think we're going to slow down and how
do you think that it will impact the AI
research process uh having uh this
assistance so I think it'll make it a
lot more efficient a research will get a
lot more efficient when you've got these
assistants that help you program um but
also help you think through things and
probably help you a lot with equations
too have you reflected much on the
process of selecting Talent has that
been mostly intuitive to you like when
Ilia shows up at the door you feel this
is smart guy let's work together so for
selecting Talent um sometimes you just
know so after talking to Ilia for not
very long he seemed very smart and then
talking him a bit more he clearly was
very smart and had very good intuitions
as well as being good at math so that
was a no-brainer there's another case
where I was at a NPS conference um we
had a poster and I someone came up and
he started asking questions about the
poster and every question he asked was a
sort of deep insight into what we'd done
wrong um and after 5 minutes I offered
him a postto position that guy was David
McKai who was just brilliant and it's
very sad he died but he was it was very
obvious you'd want him um other times
it's not so obvious and one thing I did
learn was that people are different
there's not just one type of good
student um so there's some students who
aren't that creative but are technically
extremely strong and will make anything
work there's other students who aren't
technically strong but are very creative
of course you want the ones who are both
but you don't always get that but I
think actually in the lab you need a
variety of different kinds of graduate
student but I still go with my gut
intuition that sometimes you talk to
somebody and they're just very very they
just get it and those are the ones you
want what do you think is the reason for
some folks having better intuition do
they just have better training data than
than others or how can you develop your
intuition I think it's partly they don't
stand for nonsense so here's a way to
get bad intuitions believe everything
you're told that's fatal you have to be
able to I think here's what some people
do they have a whole framework for
understanding reality and when someone
tells them something they try and sort
of figure out how that fits into their
framework and if it doesn't they just
reject it and that's a very good
strategy um people who try and
incorporate whatever they're told end up
with a framework that's sort of very
fuzzy and sort of can believe everything
and that's useless so I think actually
having a strong view of the world and
trying to manipulate incoming facts to
fit in with your view obviously it can
lead you into deep religious belief and
fatal flaws and so on like my belief in
boltzman machines um but I think that's
the way to go if you got good intuitions
you can trust you should trust them if
you got bad intuitions it doesn't matter
what you do so you might as well trust
them a very very good very good point
when when you look at the the types of
research that's that's that's being done
today do you think we're putting all of
our eggs in one basket and we should
diversify our ideas a bit more in in the
field or do you think this is the most
promising Direction so let's go all in
on it
I think having big models and training
them on multimodal data even if it's
only to predict the next word is such a
promising approach that we should go
pretty much all in on it obviously
there's lots and lots of people doing it
now and there's lots of people doing
apparently crazy things and that's good
um but I think it's fine for like most
of the people to be following this path
because it's working very well do you
think that the learning algorithms
matter that much or is it just a skill
are there basically millions of ways
that we could we could get to human
level in in intelligence or are there
sort of a select few that we need to
discover yes so this issue of whether
particular learning algorithms are very
important or whether there's a great
variety of learning algorithms that'll
do the job I don't know the answer it
seems to me though that back propagation
there's a sense in which it's the
correct thing to do getting the gradient
so that you change a parameter to make
it work better that seems like the right
thing to do and it's been amazingly
successful there may well be other
learning algorithms that are alternative
ways of getting that same gradient or
that are getting the gradient to
something else and that also work
um I think that's all open and a very
interesting issue now about whether
there's other things you can try and
maximize that will give you good systems
and maybe the brain's doing that because
it's
easier but backprop is in a sense the
right thing to do and we know that doing
it works really
well and one last question when when you
look back at your sort of Decades of
research what are you what are you most
proud of is it the students is it the
research what what makes you most proud
of when you look back at at your life's
work the learning algorithm for
boltimore machines so the learning
algorithm for Boltz machines is
beautifully elegant it's maybe hopeless
in practice um but it's the thing I
enjoyed most developing that with Terry
and it's what I'm proudest of um even if
it's
[Music]
wrong what questions do you spend most
of your time thinking about now is it
the um what what should I watch on
Netflix
5.0 / 5 (0 votes)
【人工智能】Google大神Jeff Dean最新演讲 | 机器学习令人兴奋的趋势 | 计算的十年飞跃 | 神经网络 | 语言模型十五年发展 | Gemini | ImageNet | AlexNet
The Possibilities of AI [Entire Talk] - Sam Altman (OpenAI)
【人工智能】万字通俗讲解大语言模型内部运行原理 | LLM | 词向量 | Transformer | 注意力机制 | 前馈网络 | 反向传播 | 心智理论
48 Weekly Review
DDL: Data Mesh - Lessons from the Field
EP-097 丁学良:中美交流的历史性赤字 | 丁学良 | 中国经济 | 中美关系 | 布林肯 | 留学生 | 川普 | 贸易战 | 特朗普 |