AI Agents Take the Wheel: Devin, SIMA, Figure 01 and The Future of Jobs

AI Explained
14 Mar 202419:21

Summary

TLDREl script aborda tres desarrollos clave en inteligencia artificial que sugieren una era en la que los modelos AI no solo hablan, sino que también cumplen tareas complejas. Se discuten sistemas como Devon, un AI programador, Google DeepMind SEMA, un agente multijuego y un robot humanoide con capacidad de lenguaje. A pesar de su potencial, estos sistemas aún no igualan el rendimiento humano y se espera que con futuras mejoras, como la llegada de GPT-5, se verán incrementos significativos en su desempeño, lo que podría tener profundas implicaciones para el empleo y la automatización.

Takeaways

  • 🤖 Los modelos de AI están evolucionando para no solo hablar, sino también ejecutar tareas complejas.
  • 🚀 Los sistemas como Devon, SEMA y los robots humanoides están mejorando rápidamente, pero aún tienen un largo camino hasta el rendimiento humano en sus respectivos dominios.
  • 🔄 Los avances en Devon, un sistema de ingeniería de software basado en GPT-4, sugieren que la transición a GPT-5 o Gemini 2 podría mejorar significativamente su desempeño.
  • 📈 El Benchmark de ingeniería de software que se utiliza para evaluar a Devon es un reflejo de problemas reales del mundo, pero aún así no representa todas las habilidades del ingeniero de software.
  • 💡 La capacidad de Devon para completar tareas en Upwork indica un futuro donde estas tareas podrían ser automizadas, aunque el costo y el tiempo de ejecución siguen siendo factores a considerar.
  • 🎮 SEMA de Google DeepMind demuestra que con instrucción natural y aprendizaje profundo, los agentes AI pueden ser escalables y capaces de realizar una variedad de tareas en entornos 3D simulados.
  • 🕹️ Los modelos actuales de AI, como GPT-4 Vision, están mejorando en el manejo de tareas en videojuegos y aplicaciones móviles, mostrando un rendimiento cercano al humano.
  • 🤖 Los robots humanoides con capacidades de reconocimiento visual y manipulación están en desarrollo, y su inteligencia subyacente podría mejorar drásticamente con la llegada de modelos como GPT-5.
  • 🌐 La visión del CEO de Figure de automatizar el trabajo manual y eliminar las necesidades de trabajos peligrosos y desalentadores es ambiciosa, pero también plantea preguntas sobre el impacto en el empleo y el control sobre el uso de la tecnología.
  • 🔮 Las predicciones sobre el desarrollo de la inteligencia artificial general (AGI) sugieren un futuro cercano donde los avances tecnológicos podrían tener un impacto significativo en la sociedad y el empleo humano.
  • 🌐 La preocupación sobre el control y la ética en el desarrollo de la IA es un tema importante que requiere atención y discusión pública para asegurar un uso responsable de la tecnología.

Q & A

  • ¿Qué es Devon y cómo se diferencia de Auto GPT?

    -Devon es un sistema basado en GPT-4 que no solo entiende las instrucciones sino que también puede leer documentación y desarrollar planes de acción, ejecutándolos de manera más efectiva que Auto GPT.

  • ¿Qué es el Benchmark de ingeniería de software en el que se evaluó a Devon?

    -El Benchmark de ingeniería de software es una evaluación que proviene de problemas profesionales del mundo real, compuestos por 2,294 problemas de ingeniería de software y sus soluciones correspondientes. Exige comprensión y coordinación de cambios en múltiples funciones, clases y archivos simultáneamente.

  • ¿Cómo se desempeñó Devon en el Benchmark de ingeniería de software?

    -Devon obtuvo un 14% de desempeño en el Benchmark, superando a Claude 2 y GPT-4, que tuvieron un 1.7%. Devon fue el único modelo evaluado sin asistencia, es decir, sin ser告知 exactamente qué archivos editar antes de la prueba.

  • ¿Qué implicaciones tiene el Benchmark de ingeniería de software para el futuro de la ingeniería de software?

    -El Benchmark sugiere que hay una rápida mejora en el rendimiento en tareas de ingeniería de software cuando los sistemas como Devon son equipados con modelos más avanzados como GPT-5, lo que podría llevar a una mayor automatización y cambio en la forma en que se realiza la ingeniería de software.

  • ¿Qué es SEMA de Google Deep Mind y cuál es su objetivo?

    -SEMA (Scalable, Instructable, Commandable, Multi-world Agent) es un sistema de Google Deep Mind que busca desarrollar un agente instructible capaz de realizar cualquier tarea que un humano pueda en cualquier entorno 3D simulado. Utiliza un mouse y teclado y toma pixels como entrada.

  • ¿Cómo de generalizable es el agente SEMA y cómo afectó el entrenamiento en varios juegos?

    -SEMA demostró un fuerte efecto de transferencia positiva, mejorando su desempeño en juegos nuevos después de ser entrenado en varios juegos diferentes. En algunos casos, incluso superó a un modelo especializado en un solo juego.

  • ¿Qué robot humanoide se menciona en el script y cuál es la visión del CEO detrás de él?

    -El script menciona un robot humanoide que utiliza el modelo GPT-4 Vision. El CEO detrás de él tiene una visión de automatizar el trabajo manual y eliminar la necesidad de trabajos peligrosos e indeseables, y eventualmente construir nuevos mundos en otros planetas.

  • ¿Cuál es la estimación del costo del robot humanoide mencionado y cómo afecta esto el mercado laboral?

    -El costo estimado del robot humanoide es entre $30,000 y $150,000. Aunque es demasiado caro para la mayoría de las empresas e individuos, la visión del CEO podría llevar a una reducción a largo plazo en los costos laborales y cambiar la dinámica del mercado laboral.

  • ¿Qué cambios se esperan con la llegada de GPT-5 en términos de rendimiento de sistemas como Devon y SEMA?

    -Con la llegada de GPT-5, se espera un aumento significativo en el rendimiento de sistemas como Devon y SEMA, ya que estos sistemas se beneficiarán de una mayor comprensión y capacidad de resolución de problemas gracias a la nueva versión del modelo.

  • ¿Qué se sugiere sobre el control de la inteligencia artificial en el script?

    -El script sugiere que, a pesar de los avances en la IA, no hay un control claro sobre cómo se utilizará la tecnología en el futuro y que la sociedad debería comenzar a prestar más atención a los cambios rápidos que está experimentando el mundo debido a la IA.

  • ¿Cuál es la perspectiva de Jeff Cloon sobre la llegada de la IA general (AGI) y su impacto en la sociedad?

    -Jeff Cloon cree que estamos cada vez más cerca de la IA general (AGI) y que no hay nadie realmente en control. Predice que la IA superará cualquier prueba humana en aproximadamente 5 años y tendrá un impacto significativo en la sociedad, incluyendo la automatización de tareas que hoy en día son realizadas por profesionales de la marketing.

Outlines

00:00

🤖 Avances en IA: Devon, SEMA y Robots

Este párrafo discute tres desarrollos clave en inteligencia artificial发生的最近48小时内,包括 Devon,一个AI系统由GPT-4驱动,能够执行编程任务;Google DeepMind的SEMA, un agente que juega videojuegos;以及一个 robot humanoido que demuestra una mejora significativa en la interacción con el entorno. Estos sistemas, a pesar de no estar aún a la par del rendimiento humano, sugieren un futuro donde los modelos de IA se actualizan rápidamente y se vuelven más capaces de realizar tareas complejas.

05:01

🎮 SEMA de Google DeepMind: Juego y Aprendizaje

En este párrafo se profundiza en SEMA, un agente desarrollado por Google DeepMind diseñado para ser instrucciónable y capaz de realizar tareas en entornos 3D simulados. SEMA utiliza datos de humanos jugando videojuegos para aprender y mejora su desempeño al ser entrenado en múltiples juegos, mostrando un efecto de transferencia positiva. Esto indica que, con la capacitación adecuada, los modelos de IA pueden generalizar habilidades de un juego a otro, acercándose al rendimiento humano.

10:02

🤖 Robot Humanoido con GPT-4 Vision

Este segmento describe un robot humanoido que utiliza el modelo GPT-4 Vision para interactuar con su entorno y realizar tareas. A pesar de ser impresionante en su habilidad para manipular objetos, su inteligencia proviene del modelo subyacente de IA. El robot puede tomar imágenes a una velocidad de 10 imágenes por segundo y realizar tareas de manera autónoma sin control humano. A pesar de su costo, el CEO detrás del robot tiene una visión de un futuro donde la automatización reemplace el trabajo manual, reduciendo los costos laborales y creando nuevas oportunidades en otros planetas.

15:03

🌐 Perspectivas sobre el Futuro de la IA

Finalmente, se discuten las implicaciones futuras de la IA, incluidos los comentarios de expertos en la industria que predecen un avance acelerado hacia la IA general (AGI). Se mencionan predicciones de que la IA reemplazará gran parte del trabajo humano en apenas cinco años, lo que podría llevar a una reducción significativa en los costos laborales y un cambio drástico en el panorama laboral. Sin embargo, se destaca la falta de control sobre cómo se utilizará esta tecnología y las preocupaciones sobre el impacto en los empleos seguros y deseables.

Mindmap

Keywords

💡AI modelos

Los modelos de inteligencia artificial (AI) son sistemas informáticos diseñados para realizar tareas que normalmente requieren inteligencia humana, como el aprendizaje, la toma de decisiones y el reconocimiento de patrones. En el video, se discuten los avances en modelos de AI como Devon, SEMA y robots humanoides, destacando su capacidad para mejorar y adaptarse a tareas específicas, como la ingeniería de software y el juego de video. Los modelos de AI se están desarrollando para 'caminar con el paso', es decir, realizar tareas complejas más allá de la simple interacción de habla.

💡Benchmark

Un benchmark es una prueba estándar utilizada para evaluar el rendimiento de un sistema o modelo en comparación con otros. En el contexto del video, el Benchmark de ingeniería de software es una colección de problemas y soluciones reales que se utilizan para medir la capacidad de los modelos de AI para solucionar problemas en el ámbito del software. Devon, por ejemplo, se ha evaluado en este benchmark, obteniendo un porcentaje del 14% en términos de eficacia en la resolución de problemas.

💡GPT-4 y GPT-5

GPT-4 y GPT-5 son versiones de modelos de lenguaje generados por la inteligencia artificial, desarrollados por OpenAI. Estos modelos están diseñados para comprender y generar texto de manera avanzada. GPT-4 es la versión actual que alimenta a muchos de los modelos de AI actuales, mientras que GPT-5 se espera que tenga capacidades aún más avanzadas y se espera que cuando se lance, los sistemas como Devon verán mejoras significativas en su rendimiento.

💡Reinforcement learning

El aprendizaje por refuerzo es un tipo de aprendizaje automático en el que un agente (como un modelo de AI) aprende a tomar decisiones al maximizar una recompensa, es decir, mejorando su rendimiento a través de la retroalimentación de resultados. En el video, se sugiere que Devon utiliza técnicas de aprendizaje por refuerzo para mejorar su capacidad para solucionar problemas de ingeniería de software, lo que implica que el sistema puede adaptarse y aprender de sus errores para mejorar su rendimiento en tareas específicas.

💡Transferencia de aprendizaje

La transferencia de aprendizaje se refiere a la capacidad de un modelo de AI para aplicar conocimientos adquiridos en un dominio a otros dominios o tareas relacionados. En el video, se discute cómo SEMA, el modelo de Google DeepMind, mejora su rendimiento en juegos nuevos después de ser entrenado en una variedad de juegos, lo que demuestra un fuerte efecto de transferencia. Esto es similar a cómo los modelos de IA pueden aplicar habilidades aprendidas en un juego de video a otro, mostrando una flexibilidad en su capacidad para generalizar y adaptarse a nuevas situaciones.

💡Robótica humanoide

La robótica humanoide se refiere a la creación de robots que tienen una forma o características similares a las de los seres humanos, y que están diseñados para realizar tareas que normalmente serían hechas por humanos. En el video, se menciona un robot humanoide que utiliza el modelo GPT-4 Vision para reconocer objetos y realizar tareas, como hacer lavar los platos, mientras interactúa con su entorno. Este robot es un ejemplo de cómo los modelos de IA pueden ser utilizados para controlar sistemas físicos y realizar tareas que requieren una comprensión del mundo real y la manipulación de objetos.

💡Inteligencia artificial generalizada (AGI)

La inteligencia artificial generalizada (AGI) es un tipo de IA que puede realizar cualquier tarea que un ser humano sea capaz de hacer. En el video, se sugiere que los avances en modelos de IA como Devon, SEMA y robots humanoides están acercandoся a la creación de una AGI, aunque aún no se ha logrado. Los expertos en el video discuten la posibilidad de que la IA avanzada pueda tener un impacto significativo en el empleo y la sociedad en un futuro cercano.

💡Economía humana

La economía humana se refiere al sistema económico que depende de la fuerza laboral humana y la producción de bienes y servicios. En el video, se menciona la posibilidad de que la economía humana pueda cambiar drásticamente con el avance de la IA y la robótica, ya que estos sistemas pueden tomar sobre tareas que tradicionalmente serían realizadas por humanos, lo que podría reducir el costo del trabajo manual y cambiar la dinámica laboral en general.

💡Programación

La programación es el proceso de crear código para que un ordenador o máquina ejecute tareas específicas. En el video, se destaca la capacidad de Devon para leer y ejecutar código, lo que demuestra cómo los modelos de IA están evolucionando para comprender y manipular el código de manera más profunda y efectiva. Esto puede tener implicaciones significativas para el futuro del trabajo de ingeniería de software y el desarrollo de aplicaciones.

💡Juegos de video

Los juegos de video son una forma de entretenimiento que involucra a los jugadores interactuar con un mundo virtual a través de una consola de videojuegos, un ordenador o un dispositivo móvil. En el video, se menciona cómo los modelos de IA como SEMA están siendo entrenados para jugar y tener éxito en juegos de video, lo que demuestra su capacidad para aprender y aplicar estrategias en entornos simulados y posiblemente expandir estas habilidades a otras tareas que involucren toma de decisiones y resolución de problemas.

💡Trabajo automatizado

El trabajo automatizado se refiere al proceso de reemplazar tareas humanas por sistemas automatizados, generalmente con la ayuda de la inteligencia artificial y la robótica. El video discute la posibilidad de que la automatización del trabajo, particularmente en empleos peligrosos o deseables, pueda ser una realidad en un futuro cercano gracias al avance de la IA y la robótica, lo que podría tener un impacto significativo en la fuerza laboral y la estructura económica.

Highlights

AI models are advancing to the point where they can perform tasks, not just process information.

Three AI developments in the last 48 hours show significant progress in AI capabilities.

Devon, an AI system, is equipped with a code editor and browser, allowing it to understand prompts and execute tasks.

Devon's performance on the software engineering benchmark was 14%, outperforming other models like Claude 2 and GPT 4.

The software engineering benchmark is based on real-world problems, requiring complex reasoning and coordination across multiple functions.

Devon was tested on a subset of the benchmark, which may not fully represent the scope of software engineering skills.

The benchmark may be biased towards easier problems due to the selection of accepted solutions and the introduction of new tests.

With the upcoming GPT 5, AI systems like Devon are expected to see significant upgrades overnight.

Google DeepMind's SEMA project involves training AI agents to perform tasks in simulated 3D environments using natural language instructions.

SEMA agents showed positive transfer effects, performing better on new games after training on multiple games.

The humanoid robot from Figure One demonstrates impressive dexterity and speed, controlled by an end-to-end neural network.

The humanoid robot's intelligence comes from GPT 4 Vision, which could be upgraded to GPT 5 for even deeper environmental understanding.

Figure One's CEO envisions a future where AI automates manual labor, potentially eliminating the need for unsafe and undesirable jobs.

The potential applications of AI extend beyond the current tasks, suggesting that AI could soon perform a wide range of activities on computers and phones.

AI's rapid progress suggests that we are moving closer to AGI (Artificial General Intelligence), with significant implications for society.

Experts predict that AI will pass every human test in around 5 years, indicating a timeline for when AI could match or exceed human capabilities.

The increasing compute power and algorithmic efficiency suggest that AI advancements will continue to accelerate.

Transcripts

00:00

three developments in the last 48 hours

00:02

show how we are moving into an era in

00:05

which AI models can walk the walk not

00:08

just talk the talk whether the

00:10

developments quite meet the hype

00:12

attached to them is another question

00:14

I've read and analyzed in full the three

00:16

relevant papers and Associated posts to

00:19

find out more we'll first explore Devon

00:21

the AI system your boss told you not to

00:23

worry about then Google Deep Mind SEMA

00:26

which spends most of its time playing

00:28

video games and then figure one the

00:30

humanoid robot which likes to talk while

00:32

doing the dishes but the tldw is this

00:36

these three systems are each a long way

00:39

from Human Performance in their domains

00:41

but think of them more as containers or

00:44

shells for the vision language models

00:46

powering them so when the GPT 4 that's

00:49

behind most of them is swapped out for

00:52

GPT 5 or Gemini 2 all these systems are

00:55

going to see big and hard to predict

00:58

upgrades overnight and that's a point

01:00

that seems especially relevant on this

01:02

the one-year anniversary of the release

01:05

of GPT 4 but let's start of course with

01:08

Devon build as the first AI software

01:11

engineer now Devon isn't a model it's a

01:14

system that's likely based on gp4 it's

01:18

equipped with a code editor shell and

01:20

browser so of course it cannot just

01:23

understand your prompt but look up and

01:25

read documentation a bit like Auto GPT

01:28

it's designed to come up with plans

01:31

first and then execute them but it does

01:33

so much better than Auto GPT did but

01:36

before we get to The Benchmark that

01:37

everyone's talking about let me show you

01:39

a 30-second demonstration of Devon in

01:42

action all I had to do was send this

01:43

blog post in a message to Devon from

01:46

there Devon actually does all the work

01:48

for me starting with reading this blog

01:49

post and figuring out how to run the

01:53

code in a couple minutes Devon's

01:55

actually made a lot of progress and if

01:58

we jump to the middle here

02:00

you can see that Devon's been able to

02:02

find and fix some edge cases and bugs

02:05

that the blog post did not cover for me

02:07

and if we jump to the end we can see

02:10

that Devon uh sends me the final result

02:12

which I love I also got two bonus images

02:16

uh here and here so uh let me know if

02:20

you guys see anything hidden in these it

02:22

can also F tuna model autonomously and

02:25

if you're not familiar think of that as

02:27

refining a model rather than training it

02:29

from scratch that makes me wonder about

02:31

a future where if a model can't succeed

02:34

at a task it fine-tunes another model or

02:37

itself until it can anyway this is The

02:40

Benchmark that everyone's talking

02:41

aboutwe bench software engineering bench

02:44

Devon got almost 14% And in this chart

02:48

crushes Claude 2 and GPT 4 which got

02:50

1.7% they say Devon was unassisted

02:54

whereas all other models were assisted

02:56

meaning the model was told exactly which

02:58

files need to be edited before before we

03:00

get too much further though what the

03:01

hell is this Benchmark well unlike many

03:03

benchmarks they drew from Real World

03:06

professional problems

03:09

2,294 software engineering problems that

03:11

people had and their corresponding

03:13

Solutions resolving these issues

03:15

requires understanding and coordinating

03:17

changes across multiple functions

03:19

classes and files simultaneously the

03:22

code involved might require the model to

03:24

process extremely long contexts and

03:27

perform they say complex reasoning these

03:29

aren't just fill-in the blank or

03:31

multiple choice questions the model has

03:33

to understand the issue read through the

03:35

relevant parts of the codebase remove

03:38

lines and AD lines fixing a bug might

03:40

involve navigating a large repo

03:43

understanding the interplay between

03:44

functions in different files or spatting

03:46

a small error in convoluted code on

03:49

average a model might need to edit

03:50

almost two files three functions and

03:53

about 33 lines of code one point to make

03:56

clear is that Devon was only tested on a

03:58

subset of this Benchmark and the tasks

04:00

in The Benchmark were only a tiny subset

04:03

of GitHub issues and even all of those

04:05

issues represent just a subset of the

04:07

skills of software engineering so when

04:10

you see all caps videos saying this is

04:12

Agi you've got to put it in some context

04:14

here's just one example of what I mean

04:16

they selected only pull requests which

04:18

are like proposed solutions that are

04:21

merged or accepted that solve the issue

04:24

and the introduced new tests would that

04:26

not slightly bias the data set toward

04:28

problems that are easy easier to detect

04:30

report and fix in other words complex

04:32

issues might not be adequately

04:34

represented if they're less likely to

04:36

have straightforward Solutions and

04:38

narrowing down the proposed solutions to

04:40

only those that introduce new tests

04:42

could bias towards bugs or features that

04:44

are easier to write tests for that is to

04:47

say that highly complex issues where

04:49

writing a clear test is difficult may be

04:52

underrepresented now having said all of

04:54

that I might shock You by saying I think

04:56

that there will be rapid Improvement in

04:58

the performance on this Benchmark when

05:00

Devon is equipped with GPT 5 I could see

05:03

it easily exceeding 50% here are just a

05:06

few reasons why first some of these

05:08

problems contained images and therefore

05:10

the more multimodal these language

05:12

models get the better they'll get second

05:14

and more importantly a large context

05:16

window is particularly crucial for this

05:18

task when The Benchmark came out they

05:20

said models are simply ineffective at

05:22

localizing problematic code in a sea of

05:25

tokens they get distracted by additional

05:27

context I don't think that will be true

05:29

for for much longer as we've already

05:31

seen with Gemini 1.5 third reason models

05:34

they say are often trained using

05:35

standard code files and likely rarely

05:38

see patch files I would bet that GPT 5

05:41

would have seen everything fourth

05:42

language models will be augmented they

05:44

predict with program analysis and

05:46

software engineering tools and it's

05:48

almost like they could see 6 months in

05:50

the future because they said to this end

05:52

we are particularly excited about

05:53

agent-based approaches like Devon for

05:56

identifying relevant context from a code

05:58

base I could go on but hopefully that

06:00

background on the Benchmark allows you

06:02

to put the rest of what I'm going to say

06:03

in a bit more context and yes of course

06:05

I saw how Devon was able to complete a

06:08

real job on upwork honestly I could see

06:10

these kind of tasks going the way of

06:12

copywriting tasks on upwork here's some

06:15

more context though we don't know the

06:16

actual cost of running Devon for so long

06:18

it actually takes quite a while for it

06:20

to execute on its task we're talking 15

06:23

20 30 minutes even 60 Minutes sometimes

06:25

as Bindu ready points out it can get

06:27

even more expensive than a human

06:29

although costs are of course falling

06:31

Deon she says will not be replacing any

06:33

software engineer in the near term and

06:35

noted deep learning author franois Shay

06:37

predicted this there will be more

06:39

software Engineers the kind that write

06:40

code in 5 years than there are today and

06:43

newly unemployed Andre carpath says that

06:46

software engineering is on track to

06:47

change substantially with humans more

06:50

supervising the automation pitching in

06:52

high level commands ideas or progression

06:54

strategies in English I would say with

06:56

the way things are going they could

06:58

pitch it in any language and the model

07:00

will understand frankly with vision

07:02

models the way they are you could

07:03

practically mime your code idea and it

07:05

would understand what to do and while

07:07

Devon likely relies on gyd 4 other

07:10

competitors are training their own

07:12

Frontier Scale Models indeed the startup

07:15

magic which aims to build a co-worker

07:18

not just a co-pilot for developers is

07:20

going a step further they're not even

07:22

using Transformers they say Transformers

07:24

aren't the final architecture we have

07:25

something with a multi-million token

07:27

context window super curious of course

07:29

of course how that performs on swe bench

07:32

but the thing I want to emphasize again

07:34

comes from Bloomberg cognition AI admit

07:36

that Devon is very dependent on the

07:38

underlying models and use gpc4 together

07:41

with reinforcement learning techniques

07:43

obviously that's pretty vague but

07:45

imagine when GPT 5 comes out with scale

07:47

you get so many things not just better

07:49

coding ability if you remember gpt3

07:51

couldn't actually reflect effectively

07:53

whereas GPT 4 could if GPT 5 is twice or

07:56

10 times better at reflecting and

07:59

debugging that is going to dramatically

08:01

change the performance of the Devon

08:02

system overnight just delete the GPT 4

08:05

API and put in the GPT 5 API and wait

08:08

Jeff cloon who I was going to talk about

08:10

later in this video has just retweeted

08:13

one of my own videos I literally just

08:15

saw this 2 seconds ago when it came up

08:18

as a notification on my Twitter account

08:20

this was not at all supposed to be part

08:22

of this video but I am very much honored

08:24

by that and actually I'm going to be

08:25

talking about Jeff cloon later in this

08:27

video chances are he's going to see this

08:29

video so this is getting very

08:30

inception-like he was key to Simo which

08:33

I'm going to talk about next the

08:34

simulation hypothesis just got 10% more

08:37

likely I'm going to recover from that

08:39

distraction and get back to this video

08:41

cuz there's one more thing to mention

08:43

about Devon the reaction to that model

08:45

has been unlike almost anything I've

08:47

seen people are genuinely in some

08:50

distress about the implications for jobs

08:52

and while I've given the context of what

08:54

the Benchmark does mean and doesn't mean

08:56

I can't deny that the job landscape is

08:59

incredibly unpredictable at the moment

09:01

indeed I can't see it ever not being

09:03

unpredictable I actually still have a

09:05

lot of optimism about there still being

09:07

a human economy in the future but maybe

09:09

that's a topic for another video I just

09:11

want to acknowledge that people are

09:13

scared and these companies should start

09:15

addressing those fears and I know many

09:17

of you are getting ready to comment that

09:19

we want all jobs to go but you might be

09:21

I guess disappointed by the fact that

09:24

cognition AI are asking for people to

09:27

apply to join them so obviously don't

09:29

anticipate Devon automating everything

09:31

just yet but it's time now to talk about

09:33

Google Deep Mind SEMA which is all about

09:36

scaling up agents that you can instruct

09:39

with natural language essentially a

09:41

scalable instructable commandable

09:44

multi-world agent the goal of SEMA being

09:46

to develop an instructible agent that

09:48

can accomplish anything a human can do

09:51

in any simulated 3D environment their

09:54

agent uses a mouse and keyboard and

09:57

takes pixels as input but if you think

09:59

about it that's almost everything you do

10:01

on a computer yes this paper is about

10:03

playing games but couldn't you apply

10:05

this technique to say video editing or

10:07

say anything you can do on your phone

10:09

now I know I haven't even told you what

10:10

the SEMA system is but I'm giving you an

10:12

idea of the kind of repercussions

10:14

implications if these systems work with

10:17

games there's so much else they might

10:18

soon work with this was a paper I didn't

10:20

get a chance to talk about that came out

10:22

about 6 weeks ago it showed that even

10:24

current generation models could handle

10:26

tasks on a phone like navigating on

10:28

Google Maps apps downloading apps on

10:30

Google Play or somewhat topically with

10:33

Tik Tok swiping a video about a pet cat

10:35

in Tik Tok and clicking a like for that

10:38

video no the success rates weren't

10:40

perfect but if you look at the averages

10:42

and this is for GPT 4 Vision they are

10:44

pretty high 91% 82% 82% these numbers in

10:47

the middle by the way on the left

10:49

reflect the number of steps that GPT 4

10:50

Vision took and on the right the number

10:52

of steps that a human took and that's

10:54

just gpc4 Vision not a model optimized

10:57

for agency which we know that open AI is

11:00

working on so before we even get to

11:02

video games you can imagine an internet

11:04

where there are models that are

11:06

downloading liking commenting doing pull

11:09

requests and we wouldn't even know that

11:11

it's AI it would be as far as I can tell

11:13

undetectable anyway I'm getting

11:15

distracted back to the SEMA paper what

11:17

is SEMA in a nutshell they got a bunch

11:19

of games including commercial video

11:21

games like valheim 12 million copies

11:23

sold at least and their own madeup games

11:26

that Google created they then paid a

11:28

bunch of humans to play those games and

11:31

gathered the data that's what you could

11:32

see on the screen the images and the

11:35

keyboard and mouse inputs that the

11:37

humans performed they gave all of that

11:39

training data to some pre-trained models

11:41

and at this point the paper gets quite

11:43

vague it doesn't mention parameters or

11:45

the exact composition of these

11:47

pre-trained models but from this we get

11:49

the SEMA agent which then plays these

11:51

games or more precisely tries 10sec

11:55

tasks within these games this gives you

11:57

an idea of the range of tasks everything

11:59

everything from taming and hunting to

12:01

destroying and headbutting but I don't

12:03

want to bury the lead the main takeaway

12:06

is this training on more games saw

12:08

positive transfer when SEMA played on a

12:11

new game and notice how SEMA in purple

12:14

across all of these games outperforms an

12:16

environment specialized agent that's one

12:19

trained for just one game and there is

12:21

another gem buried in this graph I'm

12:23

color blind but I'm pretty sure that's

12:25

teal or lighter blue that's zero shot

12:28

what that represents is when the model

12:30

was trained across all the other games

12:32

by the actual game it was about to be

12:34

tested in and so notice how in some

12:36

games like Goat Simulator 3 that

12:39

outperformed a model that was

12:41

specialized for just that one game the

12:44

transfer effect was so powerful it

12:46

outdid the specialized training indeed

12:49

sema's performance is approaching the

12:51

ballpark of human performance now I know

12:53

we've seen that already with Starcraft 2

12:55

and open AI beating DOTA but this would

12:57

be a model General izing to almost any

13:00

video game yes even Red Dead Redemption

13:02

2 which was covered in an entirely

13:04

separate paper out of Beijing that paper

13:06

they say was the first to enable

13:08

language models to follow the main story

13:11

line and finish real missions in complex

13:13

AAA games this time we're talking about

13:15

things like protecting a character

13:17

buying supplies equipping shotguns again

13:19

what was holding them back was the

13:21

underlying model GPT 4V as I've covered

13:23

Elsewhere on the channel it lacks in

13:26

spatial perception it's not super

13:27

accurate with moving the Cur cursor for

13:29

example but visual understanding and

13:31

performance is getting better fast take

13:34

the challenging Benchmark mm muu it's

13:37

about answering difficult questions that

13:39

have a visual component The Benchmark

13:41

only came out recently giving top

13:42

performance to GPT 4V at

13:45

56.8% but that's already been superseded

13:47

take Claude 3 Opus which gets

13:50

59.4% yes there is still a gap with

13:52

human expert performance but that Gap is

13:54

narrowing like we've seen across this

13:56

video just like Devon was solving real

13:58

world software engineering challenges

14:00

SEMA and other models are solving Real

14:03

World Games walking the walk not just

14:05

talking the talk and again we can expect

14:08

better and better results the more games

14:10

SEMA is trained on as the paper says in

14:13

every case SEMA significantly

14:14

outperforms the environment specialized

14:16

agent thus demonstrating positive

14:18

transfer across environments and this is

14:21

exactly what we see in robotics as well

14:23

the key take-home from that Google Deep

14:25

Mind paper was that our results suggest

14:28

that co-training with data from other

14:30

platforms imbus rt2 X in robotics with

14:34

additional skills that were not present

14:36

in the original data set enabling it to

14:37

perform novel tasks these were tasks and

14:40

skills developed by other robots that

14:42

were then transferred to rt2 just like

14:45

SEMA getting better at one video game by

14:47

training on others but did you notice

14:50

there that smooth segue I did to

14:52

robotics It's the final container that I

14:55

want to quickly talk about why do I call

14:57

this humanoid robot a container because

15:00

it contains GPT 4 Vision yes of course

15:03

its realtime speed and dexterity is very

15:06

impressive but that intelligence of

15:08

recognizing what's on the table and

15:10

moving it appropriately comes from the

15:12

underlying model gp4 Vision so of course

15:15

I have to make the same point that the

15:16

underlying model could easily be

15:18

upgraded to GPT 5 when it comes out this

15:21

humanoid would have a much deeper

15:23

understanding of its environment and you

15:25

as you're talking to it figure one takes

15:27

in 10 images per second and this is not

15:30

teleoperation this is an endtoend neural

15:32

network in other words there's no human

15:34

behind the scenes controlling this robot

15:37

figure don't release pricing but the

15:39

estimate is between $30,000 and

15:42

$150,000 per robot still too pricy for

15:45

most companies and individuals but the

15:47

CEO has a striking Vision he basically

15:51

wants to completely automate manual

15:53

labor this is the road map to a positive

15:56

future powered by AI he wants to build

15:59

the largest company on the planet and

16:01

eliminate the need for unsafe and

16:03

undesirable jobs the obvious question is

16:06

if it can do those jobs can't it also do

16:09

the safe and desirable jobs I know I'm

16:11

back to the jobs Point again but all of

16:13

these questions became a bit more

16:15

relevant let's say in the last 48 hours

16:17

the figure CEO goes on to predict that

16:19

everywhere from factories to Farmland

16:22

the cost of Labor will decrease until it

16:24

becomes equivalent to the price of

16:26

renting a robot facilitating a long-term

16:28

holistic reduction in costs over time

16:31

humans could leave the loop Al together

16:33

as robots become capable of building

16:35

other robots driving prices down even

16:38

more manual labor he says could become

16:41

optional and if that's not a big enough

16:43

vision for the next two decades he goes

16:45

on that the plan is also to use these

16:47

robots to build new worlds on other

16:50

planets again though we get the

16:51

reassurance that our focus is on

16:53

providing resources for jobs that humans

16:55

don't want to perform he also excludes

16:58

military applications I just feel like

17:00

his company and the world has a bit less

17:02

control over how the technology is going

17:05

to be used than he might think it does

17:07

indeed Jeff cloon of open AI Google deep

17:09

M SEMA and earlier on in this video Fame

17:13

reposted this from Edward Harris it was

17:16

a report commissioned by the US

17:18

government that he worked on and the

17:20

tldr was that things are worse than we

17:23

thought and nobody's in control I

17:25

definitely feel we're noticeably closer

17:27

to AGI this week than we were last week

17:30

as Jeff cloon put out yesterday so many

17:32

pieces of the AGI puzzle are coming

17:34

together and I would also agree that as

17:37

of today no one's really in control and

17:39

we're not alone with Jensen hang the CEO

17:42

of Nvidia saying that AI will pass every

17:45

human test in around 5 years time that

17:48

by the way is a timeline shared by Sam

17:50

Orman this is a quote from a book that's

17:53

coming out soon he was asked about what

17:54

AGI means for marketers he said oh for

17:57

that it will mean that 95% of what

17:59

marketers use agencies strategists and

18:02

creative Professionals for today will

18:04

easily nearly instantly and at almost no

18:06

cost be handled by the AI and the AI

18:09

will likely be able to test its creative

18:11

outputs against real or synthetic

18:13

customer focus groups for predicting

18:15

results and optimizing again all free

18:17

instant and nearly perfect images videos

18:20

campaign ideas no problem but

18:22

specifically on timelines he said this

18:24

when asked about when AGI will be a

18:25

reality he said 5 years give or take

18:28

Maybe slightly longer but no one knows

18:30

exactly when or what it will mean for

18:33

society and it's not like that timeline

18:35

is even unrealistic in terms of compute

18:37

using these estimates from semi analysis

18:39

I calculated that just between quarter 1

18:41

of 2024 and the fourth quarter of 2025

18:44

there will be a 14x increase in compute

18:47

then if you factor in algorithmic

18:48

efficiency doubling about every 9 months

18:51

the effective compute at the end of next

18:53

year will be almost a 100 times that of

18:56

right now so yes the world is changing

18:59

and changing fast and the public really

19:02

need to start paying attention but no

19:04

Devon is not Ai No matter how much you

19:06

put it in all caps thank you so much for

19:09

watching to the end and of course I'd

19:10

love to see you over on AI Insiders on

19:13

patreon I'd love to see you there but

19:15

regardless thank you so much for

19:17

watching and as always have a wonderful

19:20

day

Rate This

5.0 / 5 (0 votes)

Etiquetas relacionadas
Inteligencia ArtificialDesarrollo TecnológicoTrabajo del FuturoAprendizaje AutomatizadoJuegos de VideoRobótica HumanoideProgramaciónEconomía del TrabajoBenchmark de SoftwareGPT-4