AI Agents Take the Wheel: Devin, SIMA, Figure 01 and The Future of Jobs
Summary
TLDR近期的三项发展展示了人工智能模型正逐渐进入一个不仅能说会道、还能实际执行任务的时代。这三项发展包括Devon AI系统、Google DeepMind的SEMA以及使用GPT-4 Vision的机器人。尽管这些系统在各自领域离人类表现还有一段距离,但它们作为容器或外壳,搭载了强大的视觉语言模型。随着GPT-4的升级换代,这些系统有望经历难以预测的大幅改进。Devon在软件工程基准测试中表现出色,而SEMA在多款游戏中展现出积极的迁移学习能力。这些进步不仅预示着AI技术的巨大潜力,也引发了对未来工作形态的深刻思考。
Please replace the link and try again.
Q & A
Devon是什么类型的AI系统?
-Devon是一个基于GPT-4的AI系统,它配备了代码编辑器壳和浏览器,可以理解用户的提示并执行任务,如阅读文档、制定计划并执行代码。
Devon在软件工程基准测试中的表现如何?
-Devon在软件工程基准测试中的表现超过了其他模型,获得了近14%的分数,而GPT-4和Claude 2的分数分别为1.7%。Devon在测试中是无辅助的,而其他模型则有辅助。
SEMA系统的主要目标是什么?
-SEMA系统的主要目标是开发一个可指导的代理,能够在任何模拟的3D环境中完成人类可以做的任何事情。
SEMA在多款游戏中的表现如何?
-SEMA在多款游戏中表现出色,其性能接近人类水平。它在新游戏中的转移效应非常强大,有时甚至超过了专门为某一游戏训练的模型。
人形机器人的智能来源于什么?
-人形机器人的智能来源于GPT-4 Vision模型,它负责识别桌上的物品并适当地移动它们。
人形机器人的成本估计是多少?
-人形机器人的成本估计在30,000至150,000美元之间,对于大多数公司和个人来说仍然过于昂贵。
人形机器人的CEO对未来有哪些愿景?
-人形机器人的CEO希望完全自动化体力劳动,建立地球上最大的公司,并消除对不安全和不愉快工作的需求。他还预测,从工厂到农田,劳动力成本将降低,最终相当于租用机器人的价格。
Jeff Clune对AGI的预测是什么?
-Jeff Clune预测,随着AI模型的发展,我们将更接近AGI(人工通用智能),并且没有人真正控制着这一切。
Nvidia的CEO对AI未来的预测是什么?
-Nvidia的CEO预测,AI将在大约5年内通过每一项人类测试,并在市场营销领域,AI将能够几乎瞬间且几乎无成本地处理目前由代理商、策略师和创意专业人士处理的95%的工作。
Sam Altman对AGI意味着什么的看法是什么?
-Sam Altman认为AGI意味着市场营销中目前由代理商、策略师和创意专业人士处理的大部分工作将能够由AI轻松、即时且几乎无成本地完成,并且AI还能够测试其创意输出,预测结果并优化。
目前AI技术发展的速度有多快?
-根据半分析的估计,从2024年第一季度到2025年第四季度,计算能力将增加14倍。如果考虑到算法效率每9个月翻倍,明年年末的有效计算能力将是现在的近100倍。
Outlines
🤖 AI模型的实用进步
本段讨论了最近48小时内的三项发展,展示了AI模型如何从理论走向实践。首先介绍了Devon,一个基于GPT-4的AI系统,它具备代码编辑器、浏览器等功能,能够理解提示并执行任务。接着讨论了Google DeepMind的SEMA,一个通过玩游戏来学习任务的系统。最后提到了一个使用GPT-4 Vision的机器人,它能够执行洗碗等任务。这三项系统虽然在各自领域距离人类表现还有差距,但它们更像是为未来的GPT-5或Gemini 2等更高级模型提供动力的容器或外壳。
🔍 Devon的基准测试与未来展望
这一部分深入探讨了Devon系统的性能,特别是它在软件工程基准测试中的表现。Devon在解决实际软件工程问题上的能力显著优于其他模型,如GPT-4和Claude 2。然而,这个基准测试只涵盖了GitHub问题的一个子集,可能偏向于那些容易检测和修复的问题。作者预测,随着GPT-5的推出,Devon的性能将得到显著提升。同时,讨论了Devon在Upwork上完成真实工作的能力,以及其对软件工程师职业未来的潜在影响。
🎮 游戏与多模态学习的结合
本段聚焦于Google DeepMind的SEMA系统,这是一个可以通过自然语言指令进行扩展的多世界代理。SEMA的目标是开发一个可以在任何模拟3D环境中完成人类所能做任何事情的可指导代理。SEMA通过玩各种游戏并从人类玩家那里学习,展示了在新游戏中的正向迁移能力,即使在没有特别针对特定游戏训练的情况下,也能超越专门训练的模型。这表明了多模态学习和任务泛化的强大潜力。
👨🚀 机器人技术与未来劳动
最后一部分讨论了一个使用GPT-4 Vision的机器人,它能够识别桌上的物品并适当地移动它们。这种机器人的智能来自于底层模型,未来可能会升级到GPT-5,从而获得更深层次的环境理解。此外,讨论了机器人技术对未来劳动市场的影响,包括可能完全自动化体力劳动的愿景,以及对人类工作未来的不确定性。作者强调,尽管这些技术带来了巨大的变化,但目前还没有人能够完全控制AI的发展方向。
Mindmap
Please replace the link and try again.
Keywords
Please replace the link and try again.
Highlights
Please replace the link and try again.
Transcripts
three developments in the last 48 hours
show how we are moving into an era in
which AI models can walk the walk not
just talk the talk whether the
developments quite meet the hype
attached to them is another question
I've read and analyzed in full the three
relevant papers and Associated posts to
find out more we'll first explore Devon
the AI system your boss told you not to
worry about then Google Deep Mind SEMA
which spends most of its time playing
video games and then figure one the
humanoid robot which likes to talk while
doing the dishes but the tldw is this
these three systems are each a long way
from Human Performance in their domains
but think of them more as containers or
shells for the vision language models
powering them so when the GPT 4 that's
behind most of them is swapped out for
GPT 5 or Gemini 2 all these systems are
going to see big and hard to predict
upgrades overnight and that's a point
that seems especially relevant on this
the one-year anniversary of the release
of GPT 4 but let's start of course with
Devon build as the first AI software
engineer now Devon isn't a model it's a
system that's likely based on gp4 it's
equipped with a code editor shell and
browser so of course it cannot just
understand your prompt but look up and
read documentation a bit like Auto GPT
it's designed to come up with plans
first and then execute them but it does
so much better than Auto GPT did but
before we get to The Benchmark that
everyone's talking about let me show you
a 30-second demonstration of Devon in
action all I had to do was send this
blog post in a message to Devon from
there Devon actually does all the work
for me starting with reading this blog
post and figuring out how to run the
code in a couple minutes Devon's
actually made a lot of progress and if
we jump to the middle here
you can see that Devon's been able to
find and fix some edge cases and bugs
that the blog post did not cover for me
and if we jump to the end we can see
that Devon uh sends me the final result
which I love I also got two bonus images
uh here and here so uh let me know if
you guys see anything hidden in these it
can also F tuna model autonomously and
if you're not familiar think of that as
refining a model rather than training it
from scratch that makes me wonder about
a future where if a model can't succeed
at a task it fine-tunes another model or
itself until it can anyway this is The
Benchmark that everyone's talking
aboutwe bench software engineering bench
Devon got almost 14% And in this chart
crushes Claude 2 and GPT 4 which got
1.7% they say Devon was unassisted
whereas all other models were assisted
meaning the model was told exactly which
files need to be edited before before we
get too much further though what the
hell is this Benchmark well unlike many
benchmarks they drew from Real World
professional problems
2,294 software engineering problems that
people had and their corresponding
Solutions resolving these issues
requires understanding and coordinating
changes across multiple functions
classes and files simultaneously the
code involved might require the model to
process extremely long contexts and
perform they say complex reasoning these
aren't just fill-in the blank or
multiple choice questions the model has
to understand the issue read through the
relevant parts of the codebase remove
lines and AD lines fixing a bug might
involve navigating a large repo
understanding the interplay between
functions in different files or spatting
a small error in convoluted code on
average a model might need to edit
almost two files three functions and
about 33 lines of code one point to make
clear is that Devon was only tested on a
subset of this Benchmark and the tasks
in The Benchmark were only a tiny subset
of GitHub issues and even all of those
issues represent just a subset of the
skills of software engineering so when
you see all caps videos saying this is
Agi you've got to put it in some context
here's just one example of what I mean
they selected only pull requests which
are like proposed solutions that are
merged or accepted that solve the issue
and the introduced new tests would that
not slightly bias the data set toward
problems that are easy easier to detect
report and fix in other words complex
issues might not be adequately
represented if they're less likely to
have straightforward Solutions and
narrowing down the proposed solutions to
only those that introduce new tests
could bias towards bugs or features that
are easier to write tests for that is to
say that highly complex issues where
writing a clear test is difficult may be
underrepresented now having said all of
that I might shock You by saying I think
that there will be rapid Improvement in
the performance on this Benchmark when
Devon is equipped with GPT 5 I could see
it easily exceeding 50% here are just a
few reasons why first some of these
problems contained images and therefore
the more multimodal these language
models get the better they'll get second
and more importantly a large context
window is particularly crucial for this
task when The Benchmark came out they
said models are simply ineffective at
localizing problematic code in a sea of
tokens they get distracted by additional
context I don't think that will be true
for for much longer as we've already
seen with Gemini 1.5 third reason models
they say are often trained using
standard code files and likely rarely
see patch files I would bet that GPT 5
would have seen everything fourth
language models will be augmented they
predict with program analysis and
software engineering tools and it's
almost like they could see 6 months in
the future because they said to this end
we are particularly excited about
agent-based approaches like Devon for
identifying relevant context from a code
base I could go on but hopefully that
background on the Benchmark allows you
to put the rest of what I'm going to say
in a bit more context and yes of course
I saw how Devon was able to complete a
real job on upwork honestly I could see
these kind of tasks going the way of
copywriting tasks on upwork here's some
more context though we don't know the
actual cost of running Devon for so long
it actually takes quite a while for it
to execute on its task we're talking 15
20 30 minutes even 60 Minutes sometimes
as Bindu ready points out it can get
even more expensive than a human
although costs are of course falling
Deon she says will not be replacing any
software engineer in the near term and
noted deep learning author franois Shay
predicted this there will be more
software Engineers the kind that write
code in 5 years than there are today and
newly unemployed Andre carpath says that
software engineering is on track to
change substantially with humans more
supervising the automation pitching in
high level commands ideas or progression
strategies in English I would say with
the way things are going they could
pitch it in any language and the model
will understand frankly with vision
models the way they are you could
practically mime your code idea and it
would understand what to do and while
Devon likely relies on gyd 4 other
competitors are training their own
Frontier Scale Models indeed the startup
magic which aims to build a co-worker
not just a co-pilot for developers is
going a step further they're not even
using Transformers they say Transformers
aren't the final architecture we have
something with a multi-million token
context window super curious of course
of course how that performs on swe bench
but the thing I want to emphasize again
comes from Bloomberg cognition AI admit
that Devon is very dependent on the
underlying models and use gpc4 together
with reinforcement learning techniques
obviously that's pretty vague but
imagine when GPT 5 comes out with scale
you get so many things not just better
coding ability if you remember gpt3
couldn't actually reflect effectively
whereas GPT 4 could if GPT 5 is twice or
10 times better at reflecting and
debugging that is going to dramatically
change the performance of the Devon
system overnight just delete the GPT 4
API and put in the GPT 5 API and wait
Jeff cloon who I was going to talk about
later in this video has just retweeted
one of my own videos I literally just
saw this 2 seconds ago when it came up
as a notification on my Twitter account
this was not at all supposed to be part
of this video but I am very much honored
by that and actually I'm going to be
talking about Jeff cloon later in this
video chances are he's going to see this
video so this is getting very
inception-like he was key to Simo which
I'm going to talk about next the
simulation hypothesis just got 10% more
likely I'm going to recover from that
distraction and get back to this video
cuz there's one more thing to mention
about Devon the reaction to that model
has been unlike almost anything I've
seen people are genuinely in some
distress about the implications for jobs
and while I've given the context of what
the Benchmark does mean and doesn't mean
I can't deny that the job landscape is
incredibly unpredictable at the moment
indeed I can't see it ever not being
unpredictable I actually still have a
lot of optimism about there still being
a human economy in the future but maybe
that's a topic for another video I just
want to acknowledge that people are
scared and these companies should start
addressing those fears and I know many
of you are getting ready to comment that
we want all jobs to go but you might be
I guess disappointed by the fact that
cognition AI are asking for people to
apply to join them so obviously don't
anticipate Devon automating everything
just yet but it's time now to talk about
Google Deep Mind SEMA which is all about
scaling up agents that you can instruct
with natural language essentially a
scalable instructable commandable
multi-world agent the goal of SEMA being
to develop an instructible agent that
can accomplish anything a human can do
in any simulated 3D environment their
agent uses a mouse and keyboard and
takes pixels as input but if you think
about it that's almost everything you do
on a computer yes this paper is about
playing games but couldn't you apply
this technique to say video editing or
say anything you can do on your phone
now I know I haven't even told you what
the SEMA system is but I'm giving you an
idea of the kind of repercussions
implications if these systems work with
games there's so much else they might
soon work with this was a paper I didn't
get a chance to talk about that came out
about 6 weeks ago it showed that even
current generation models could handle
tasks on a phone like navigating on
Google Maps apps downloading apps on
Google Play or somewhat topically with
Tik Tok swiping a video about a pet cat
in Tik Tok and clicking a like for that
video no the success rates weren't
perfect but if you look at the averages
and this is for GPT 4 Vision they are
pretty high 91% 82% 82% these numbers in
the middle by the way on the left
reflect the number of steps that GPT 4
Vision took and on the right the number
of steps that a human took and that's
just gpc4 Vision not a model optimized
for agency which we know that open AI is
working on so before we even get to
video games you can imagine an internet
where there are models that are
downloading liking commenting doing pull
requests and we wouldn't even know that
it's AI it would be as far as I can tell
undetectable anyway I'm getting
distracted back to the SEMA paper what
is SEMA in a nutshell they got a bunch
of games including commercial video
games like valheim 12 million copies
sold at least and their own madeup games
that Google created they then paid a
bunch of humans to play those games and
gathered the data that's what you could
see on the screen the images and the
keyboard and mouse inputs that the
humans performed they gave all of that
training data to some pre-trained models
and at this point the paper gets quite
vague it doesn't mention parameters or
the exact composition of these
pre-trained models but from this we get
the SEMA agent which then plays these
games or more precisely tries 10sec
tasks within these games this gives you
an idea of the range of tasks everything
everything from taming and hunting to
destroying and headbutting but I don't
want to bury the lead the main takeaway
is this training on more games saw
positive transfer when SEMA played on a
new game and notice how SEMA in purple
across all of these games outperforms an
environment specialized agent that's one
trained for just one game and there is
another gem buried in this graph I'm
color blind but I'm pretty sure that's
teal or lighter blue that's zero shot
what that represents is when the model
was trained across all the other games
by the actual game it was about to be
tested in and so notice how in some
games like Goat Simulator 3 that
outperformed a model that was
specialized for just that one game the
transfer effect was so powerful it
outdid the specialized training indeed
sema's performance is approaching the
ballpark of human performance now I know
we've seen that already with Starcraft 2
and open AI beating DOTA but this would
be a model General izing to almost any
video game yes even Red Dead Redemption
2 which was covered in an entirely
separate paper out of Beijing that paper
they say was the first to enable
language models to follow the main story
line and finish real missions in complex
AAA games this time we're talking about
things like protecting a character
buying supplies equipping shotguns again
what was holding them back was the
underlying model GPT 4V as I've covered
Elsewhere on the channel it lacks in
spatial perception it's not super
accurate with moving the Cur cursor for
example but visual understanding and
performance is getting better fast take
the challenging Benchmark mm muu it's
about answering difficult questions that
have a visual component The Benchmark
only came out recently giving top
performance to GPT 4V at
56.8% but that's already been superseded
take Claude 3 Opus which gets
59.4% yes there is still a gap with
human expert performance but that Gap is
narrowing like we've seen across this
video just like Devon was solving real
world software engineering challenges
SEMA and other models are solving Real
World Games walking the walk not just
talking the talk and again we can expect
better and better results the more games
SEMA is trained on as the paper says in
every case SEMA significantly
outperforms the environment specialized
agent thus demonstrating positive
transfer across environments and this is
exactly what we see in robotics as well
the key take-home from that Google Deep
Mind paper was that our results suggest
that co-training with data from other
platforms imbus rt2 X in robotics with
additional skills that were not present
in the original data set enabling it to
perform novel tasks these were tasks and
skills developed by other robots that
were then transferred to rt2 just like
SEMA getting better at one video game by
training on others but did you notice
there that smooth segue I did to
robotics It's the final container that I
want to quickly talk about why do I call
this humanoid robot a container because
it contains GPT 4 Vision yes of course
its realtime speed and dexterity is very
impressive but that intelligence of
recognizing what's on the table and
moving it appropriately comes from the
underlying model gp4 Vision so of course
I have to make the same point that the
underlying model could easily be
upgraded to GPT 5 when it comes out this
humanoid would have a much deeper
understanding of its environment and you
as you're talking to it figure one takes
in 10 images per second and this is not
teleoperation this is an endtoend neural
network in other words there's no human
behind the scenes controlling this robot
figure don't release pricing but the
estimate is between $30,000 and
$150,000 per robot still too pricy for
most companies and individuals but the
CEO has a striking Vision he basically
wants to completely automate manual
labor this is the road map to a positive
future powered by AI he wants to build
the largest company on the planet and
eliminate the need for unsafe and
undesirable jobs the obvious question is
if it can do those jobs can't it also do
the safe and desirable jobs I know I'm
back to the jobs Point again but all of
these questions became a bit more
relevant let's say in the last 48 hours
the figure CEO goes on to predict that
everywhere from factories to Farmland
the cost of Labor will decrease until it
becomes equivalent to the price of
renting a robot facilitating a long-term
holistic reduction in costs over time
humans could leave the loop Al together
as robots become capable of building
other robots driving prices down even
more manual labor he says could become
optional and if that's not a big enough
vision for the next two decades he goes
on that the plan is also to use these
robots to build new worlds on other
planets again though we get the
reassurance that our focus is on
providing resources for jobs that humans
don't want to perform he also excludes
military applications I just feel like
his company and the world has a bit less
control over how the technology is going
to be used than he might think it does
indeed Jeff cloon of open AI Google deep
M SEMA and earlier on in this video Fame
reposted this from Edward Harris it was
a report commissioned by the US
government that he worked on and the
tldr was that things are worse than we
thought and nobody's in control I
definitely feel we're noticeably closer
to AGI this week than we were last week
as Jeff cloon put out yesterday so many
pieces of the AGI puzzle are coming
together and I would also agree that as
of today no one's really in control and
we're not alone with Jensen hang the CEO
of Nvidia saying that AI will pass every
human test in around 5 years time that
by the way is a timeline shared by Sam
Orman this is a quote from a book that's
coming out soon he was asked about what
AGI means for marketers he said oh for
that it will mean that 95% of what
marketers use agencies strategists and
creative Professionals for today will
easily nearly instantly and at almost no
cost be handled by the AI and the AI
will likely be able to test its creative
outputs against real or synthetic
customer focus groups for predicting
results and optimizing again all free
instant and nearly perfect images videos
campaign ideas no problem but
specifically on timelines he said this
when asked about when AGI will be a
reality he said 5 years give or take
Maybe slightly longer but no one knows
exactly when or what it will mean for
society and it's not like that timeline
is even unrealistic in terms of compute
using these estimates from semi analysis
I calculated that just between quarter 1
of 2024 and the fourth quarter of 2025
there will be a 14x increase in compute
then if you factor in algorithmic
efficiency doubling about every 9 months
the effective compute at the end of next
year will be almost a 100 times that of
right now so yes the world is changing
and changing fast and the public really
need to start paying attention but no
Devon is not Ai No matter how much you
put it in all caps thank you so much for
watching to the end and of course I'd
love to see you over on AI Insiders on
patreon I'd love to see you there but
regardless thank you so much for
watching and as always have a wonderful
day
5.0 / 5 (0 votes)
2024 AI - 10 things Coming in 2024!
GPT-4o - Full Breakdown + Bonus Details
【人工智能】万字通俗讲解大语言模型内部运行原理 | LLM | 词向量 | Transformer | 注意力机制 | 前馈网络 | 反向传播 | 心智理论
【人工智能】Google大神Jeff Dean最新演讲 | 机器学习令人兴奋的趋势 | 计算的十年飞跃 | 神经网络 | 语言模型十五年发展 | Gemini | ImageNet | AlexNet
In conversation | Geoffrey Hinton and Joel Hellermark
【人工智能】中国大模型行业的五个真问题 | 究竟应该如何看待国内大模型行业的发展现状 | 模型 | 算力 | 数据 | 资本 | 商业化 | 人才 | 反思