"VoT" Gives LLMs Spacial Reasoning AND Open-Source "Large Action Model"

Matthew Berman
8 May 202416:25

Summary

TLDRتستعرض هذه الحلقة تقنية جديدة من مايكروسوفت تتيح التحكم في تطبيقات نظام ويندوز من خلال الأوامر الصوتية، مشابهة لتلك الموجودة في بيئة الأندرويد. تُعرف هذه التقنية باسم 'التصور البصري للفكر' والتي تُعزز القدرة على التفكير المكاني في النماذج اللغوية الكبيرة، مما يُتيح إمكانيات جديدة للتفاعل مع الأجهزة بطريقة أكثر فعالية. يتم تقديم نظرة عميقة حول كيفية تطبيق هذه التقنية في نماذج معينة مثل التنقل اللغوي الطبيعي والملاحة البصرية، وتُقدم أيضاً نظرة على مشروع مفتوح المصدر يمكن من خلاله تجربة هذه التكنولوجيا.

Takeaways

  • 📜 Microsoft released an open-source project called Pi-Win Assistant, which is a large action model for the Windows environment, similar to how the Rabbit, R1 controls applications within Android using natural language.
  • 🔍 The project is accompanied by a research paper titled 'Visualization of Thought Elicits Spatial Reasoning and Large Language Models', which outlines how Microsoft achieved spatial reasoning capabilities in large language models.
  • 🧠 Spatial reasoning is the ability to visualize relationships between objects in a 3D or 2D environment, which has been a historically weak area for large language models.
  • 💡 Yan LeCun, the lead of Meta AI, has previously stated that spatial reasoning is a core missing feature that would prevent us from reaching AGI (Artificial General Intelligence).
  • 📈 The paper demonstrates that it is possible to achieve spatial reasoning with large language models using a technique called 'visualization of thought' (VOT) prompting.
  • 📈 VOT prompting involves asking the model to visualize and represent its reasoning steps at each stage before reaching the output, which significantly improves performance on spatial reasoning tasks.
  • 📊 The research tested three tasks requiring spatial awareness: natural language navigation, visual navigation, and visual tiling, using 2D grid worlds represented in natural language for the models to understand.
  • 🚀 The Pi-Win Assistant project allows users to control a Windows environment using natural language, showcasing the practical application of the research findings.
  • 📚 The research paper and the open-source project are available for anyone interested in exploring or utilizing the advancements in large language models for spatial reasoning.
  • 🔑 The success of VOT prompting in enhancing spatial reasoning in large language models could be a significant step towards more sophisticated AI capabilities.
  • ⚙️ The limitations of the approach include potential performance deterioration in less advanced language models or more challenging tasks, highlighting the need for further development and refinement.

Q & A

  • ما هي الميزة الرئيسية لـ 'Pi win assistant'؟

    -Pi win assistant هو أول نموذج مفتوح المصدر لـ 'large action model' يتحكم في واجهات المستخدم البشرية باللغة الطبيعية.

  • ماذا يشير 'visualization of thought' إلى في السياق الحالي؟

    -visualization of thought يشير إلى تقنية تعزز النماذج الكبيرة لغوياً بلوحة رسوم مرئية لتصور خطوات التفكير وتوجيه الخطوات التالية.

  • ماذا يشير النص بـ 'The Mind's Eye'؟

    -The Mind's Eye يشير إلى قدرة البشر على خلق صور ذهنية لأشياء غير مرئية، مما يتيح تخيل عالم غير مرئي.

  • ما هي الأهمية الأساسية لفهم المكان (spatial reasoning) في الذكاء الاصطناعي؟

    -الفهم المكاني هو مهارة أساسية تتيح التفاعل مع العالم ثلاثي الأبعاد، وهو ضروري في التنقل التكنولوجي والروبوتات والقود الذاتي.

  • كيف يساعد 'visualization of thought prompting' في تحسين أداء النماذج الكبيرة؟

    -visualization of thought prompting يساعد في تحسين أداء النماذج الكبيرة من خلال توليد تسلسلات التفكير المرئي وتصورات بطريقة متداخلة.

  • ما هي المهام الثلاثة التي تم اختبارها في الدراسة لتقييم القدرة على التفكير المكاني؟

    -المهام الثلاثة هي التنقل اللغوي، التنقل المرئي، والتخطيط التiling المرئي.

  • ماذا يشير النص بـ 'Chain of Thought' إلى؟

    -Chain of Thought هو تقنية تعزز الأداء من قبل النماذج الكبيرة للغة من خلال تقديم التفكير خطوة بخطوة بدلاً من تقديم الإجابة مباشرة.

  • ما هي التأثيرات المحتملة لتقنيات التحفيز المختلفة على النتيجة؟

    -تقنيات التحفيز المختلفة مثل visualization of thought prompting و Chain of Thought يمكن أن يؤثروا بشكل إيجابي على الأداء في المهام المطلوبة.

  • ما هي القيود المحتملة لتقنيات التصور الداخلي والتتبع المكاني؟

    -التصور الداخلي والتتبع المكاني يعتمدان على قدرات النماذج اللغوية المتقدمة، لذا قد يتسبب في تحسن الأداء في المهام الأكثر تحدياً.

  • كيف يستطيع 'Pi win assistant' التعامل مع واجهات المستخدم البشرية؟

    -Pi win assistant يستطيع التعامل مع واجهات المستخدم البشرية من خلال تحليل التعليمات وإنشاء خطوات التنفيذ وتنفيذها خطوة بخطوة.

  • ماذا يشير النص بـ 'open-source project' إلى؟

    -open-source project يشير إلى مشروع مفتوح المصدر يمكن لأي شخص تحميله واستخدامه أو تطويره، وهو يتيح للمجتمع المساهمة في تطوير المشروع.

Outlines

00:00

📘 Introduction to Open Source Large Language Models for Windows

The video begins with an introduction to a significant development in the field of AI: an open-source version of a large action model for the Windows environment, similar to how the Rabbit, R1 controls applications within the Android environment using natural language. Microsoft has released both a research paper detailing their methodology and an open-source project that can be downloaded and utilized immediately. The paper, 'Visualization of Thought Elicits Spatial Reasoning and Large Language Models,' is discussed, which explores enhancing large language models with spatial reasoning capabilities, a feature historically lacking in such models and considered by Yan LeCun, the head of Meta AI, as a crucial missing piece for achieving AGI (Artificial General Intelligence). The paper demonstrates that spatial reasoning is indeed possible with large language models, using the example of visualizing a path from the North Pole to illustrate the concept.

05:01

🎯 Enhancing LLMs with Spatial Reasoning through Visualization of Thought

The video continues to delve into the concept of spatial reasoning, which is integral to human cognition and involves the ability to visualize relationships between objects in a 2D or 3D environment. The paper proposes a method called 'Visualization of Thought' (VOT) prompting, which aims to enhance large language models (LLMs) with a visual spatial sketch pad to visualize their reasoning steps. This method is tested on three tasks requiring spatial awareness: natural language navigation, visual navigation, and visual tiling. The video explains how LLMs can be improved by using VOT prompting, which involves asking the model to visualize and represent its thought process at each step before reaching the output. The results show significant performance improvements in tasks that require spatial reasoning.

10:03

🤖 Pi-Win Assistant: An Open Source Large Action Model for Windows

The video introduces 'Pi-Win Assistant,' an open-source large action model that controls human user interfaces using natural language, without the need for prior training on specific tasks. The assistant is demonstrated performing a series of tasks within a Windows environment, such as opening Firefox and navigating to YouTube, by following a sequence of natural language instructions. The assistant's ability to interpret and act on these instructions is based on the techniques discussed in the research paper, showcasing its capability to visualize and execute tasks step by step. The video also highlights the limitations of such systems, noting that they rely on the emerging capabilities of advanced LLMs and may not perform as well on less advanced models or more complex tasks.

15:05

📈 Performance Evaluation and Practical Applications of LLMs

The video concludes with a discussion on the performance evaluation of different versions of GPT-4, including those with and without visualization capabilities, across various tasks. It is shown that the visualization of thought (VOT) prompting technique significantly outperforms other methods in tasks like route planning, next step prediction, visual tiling, and natural language navigation. The video also touches on the limitations of mental images and visual state tracking in less advanced language models and more challenging tasks. Finally, the video provides examples of practical applications of the Pi-Win Assistant, such as making a new Twitter post, and encourages viewers to explore the research paper and consider a full tutorial on the assistant.

Mindmap

Keywords

💡大型语言模型

大型语言模型(Large Language Models, LLMs)是指具有大量参数的人工智能模型,它们能够处理和生成自然语言。在视频中,这些模型通过可视化思维(Visualization of Thought)提示技术被赋予了空间推理的能力,这在历史上是它们做得不好的领域。例如,视频提到了如何通过可视化思维提示来增强LLMs在空间推理任务中的表现,如导航和视觉平铺。

💡空间推理

空间推理(Spatial Reasoning)是指理解和推理物体之间空间关系的能力。在视频中,它被描述为一个3D或2D环境中不同物体之间关系的视觉化。空间推理对于人类来说是自然而然的,但对LLMs来说却是一个挑战。视频通过一个例子说明了空间推理:想象站在北极点上,向一个方向走50码,然后左转并继续走,思考是否会再次经过起始点。

💡可视化思维提示(VOT)

可视化思维提示(Visualization of Thought, VOT)是一种先进的提示技术,它要求LLMs在每一步推理过程中生成内部的视觉图像。这种方法在视频中被用来增强LLMs的空间推理能力。例如,通过VOT提示,LLMs能够在2D网格世界中导航,或者在视觉平铺任务中找到放置多米诺骨牌的正确位置。

💡

💡自然语言导航

自然语言导航是一种使用自然语言指令来指导模型在虚拟空间中移动的任务。在视频中,这涉及到描述一个3x3的网格,并逐步告诉模型如何从一个点移动到另一个点。自然语言导航展示了LLMs如何使用文本描述来理解和执行空间任务,这是空间推理在现实世界应用的一个例子。

💡视觉导航

视觉导航是一种要求模型使用视觉提示在2D网格世界中导航的任务。这要求模型生成导航指令,以四个方向(左、右、上、下)移动,同时避免障碍物。视频中提到,视觉导航任务包括路线规划和下一步预测两个子测试,都需要多步骤的空间推理。

💡视觉平铺

视觉平铺是一种经典的空间推理挑战,要求模型在一个有限的网格区域内理解、组织和推理形状。在视频中,通过使用不同颜色和形状的网格,模型被要求找到放置新对象的位置。这测试了LLMs在受限空间内对形状的理解和操作能力。

💡Pi Win Assistant

Pi Win Assistant是视频中提到的一个开源项目,它是第一个开源的大型动作模型(Large Action Model),专门用于控制完全由自然语言控制的人类用户界面。这个项目使用了与视频中讨论的空间推理技术相同的技术,允许用户通过自然语言指令来控制Windows环境中的应用程序。

💡链式思考(Chain of Thought)

链式思考(Chain of Thought)是一种提示技术,要求LLMs逐步展示它们是如何从输入到达输出的。在视频中,这种技术被用来提高LLMs在空间推理任务中的表现。例如,通过链式思考提示,模型能够更好地理解和执行导航指令,因为它必须逐步展示它的思考过程。

💡零样本提示(Zero Shot Prompting)

零样本提示是一种不依赖于少量样本演示或文本到图像的可视化方法,而是直接要求模型生成输出。在视频中,这种方法被用来评估VOT和空间推理的有效性。零样本提示允许模型在没有先前示例的情况下直接对输入做出反应,这在空间推理任务中特别有用。

💡性能提升

性能提升(Performance Improvements)在视频中指的是通过使用VOT和其他高级提示技术,LLMs在空间推理任务中的表现得到了显著提高。例如,视频展示了使用VOT提示的GPT-4在路线规划、下一步预测、视觉平铺和自然语言导航等任务中的成功率明显优于未使用这些技术的模型。

💡限制

在视频中提到的限制指的是,尽管VOT和其他提示技术能够提高LLMs的空间推理能力,但这些技术依赖于先进的LLMs的新兴能力,可能会导致在不那么先进的语言模型或更具挑战性的任务中性能下降。这强调了在应用这些技术时需要考虑模型的能力和任务的复杂性。

Highlights

Microsoft has released an open-source version of a large action model for the Windows environment, similar to how the Rabbit, R1 controls applications within the Android environment using natural language.

The research paper 'Visualization of Thought Elicits Spatial Reasoning and Large Language Models' outlines a method to give large language models spatial reasoning capabilities.

Spatial reasoning is the ability to visualize relationships between objects in a 3D or 2D environment, a feature historically lacking in large language models.

The paper demonstrates that it's possible to achieve spatial reasoning with large language models, contrary to previous beliefs.

The 'Mind's Eye' concept allows humans to create mental images of unseen objects and actions, a cognitive capacity that the research aims to replicate in large language models.

Visualization of Thought (VOT) prompting is proposed to elicit the 'Mind's Eye' of large language models for spatial reasoning.

VOT adopts zero-shot prompting and is evaluated for effectiveness in tasks requiring spatial awareness, such as natural language navigation, visual navigation, and visual tiling.

The research introduces a 2D grid world for visual navigation and tiling tasks, using special characters as enriched input formats for large language models.

VOT prompting significantly improves the performance of large language models on corresponding spatial reasoning tasks.

Spatial reasoning is crucial for various aspects of life, including navigation, robotics, and autonomous driving.

The paper presents an algorithm to denote the graph and walking path for spatial reasoning challenges.

The Pi-Win Assistant project is an example of an open-source large action model that controls human user interfaces using natural language, leveraging the techniques from the research paper.

The Pi-Win Assistant can perform tasks within the Windows environment, such as opening applications and navigating web pages, through step-by-step instructions.

The assistant uses visualization at each step, similar to the Chain of Thought prompting, to generate a trace of the path taken to reach the solution.

The research shows that VOT prompting technique outperforms other methods across various tasks, indicating its effectiveness in enhancing spatial reasoning in large language models.

Limitations of the approach include potential performance deterioration in less advanced language models or more challenging tasks.

The research paper and the Pi-Win Assistant project provide a foundation for future developments in enhancing spatial reasoning capabilities of large language models.

Transcripts

00:00

today we have an opsource large action

00:04

model so very similar to how the rabbit

00:06

R1 can control applications within the

00:08

Android environment just by speaking

00:10

natural language we now have a

00:12

completely open- Source version of that

00:13

for the windows environment released by

00:15

Microsoft so not only did Microsoft

00:18

release a research paper outlining how

00:20

they were able to achieve it they also

00:22

have an open-source project which you

00:24

can download and use right away and I'm

00:26

going to show you that today so first

00:28

let's go over the white paper this is

00:30

called visualization of thought elicits

00:32

spatial reasoning and large language

00:34

models and essentially what this paper

00:36

describes is a way to give large

00:39

language models spatial reasoning and if

00:42

you're not familiar with what spatial

00:43

reasoning means it's basically just the

00:45

ability to visualize the relationships

00:47

in a 3D environment or even a 2d

00:50

environment between different objects

00:52

and this is something that large

00:53

language models have historically done

00:56

really poorly and the lead of meta AI

00:58

Yan laon has talked about about this as

01:00

being a core missing feature of large

01:02

language models that will prevent us

01:04

from reaching AGI but in this paper they

01:07

show that it's actually possible to get

01:09

spatial reasoning out of large language

01:11

models so let me give you an example of

01:13

what spatial reasoning is in your mind

01:15

think about this you're standing at a

01:17

point on the North Pole and you start

01:20

walking and you walk 50 yards in One

01:22

Direction then you turn left and then

01:25

you continue to walk indefinitely now

01:27

think about this if you continued

01:29

walking would you ever cross over that

01:32

initial point now you're doing all of

01:34

this spatial reasoning in your head

01:36

through what's called your mind's eye

01:38

language isn't really involved when

01:41

you're thinking through this problem and

01:43

that is what spatial reasoning is and

01:45

that is why Yan laon thinks spatial

01:47

reasoning is not possible with language

01:49

models alone but according to this paper

01:52

it definitely is so let me get into it

01:54

and remember stick around to after this

01:56

because I'm actually going to show it to

01:57

you in action in an open source project

02:00

so this is out of Microsoft research so

02:02

in the beginning it talks about how

02:03

large language models are really great

02:05

however their abilities in spatial

02:06

reasoning a crucial aspect of human

02:08

cognition remain relatively unexplored

02:11

humans possess a remarkable ability to

02:14

create mental images of unseen objects

02:16

and actions through a process known as

02:17

The Mind's Eye enabling the imagination

02:20

of the Unseen World inspired by this

02:22

cognitive capacity we propose

02:24

visualization of thought prompting and

02:27

I'm going to show you why this will

02:28

translate into a large action model

02:31

because right now it's called

02:32

visualization of thought but if we take

02:34

this technique and we apply it to a user

02:37

interface we can actually control that

02:38

user interface and that's essentially

02:40

what a large action model is so let's

02:42

look at this diagram this is what is

02:45

happening in the human mind we have

02:47

visuals we have verbal language we put

02:50

it all together in what is called The

02:52

Mind's Eye and then we put together a

02:54

mental image of whatever we're thinking

02:57

about now on the right side is what is

02:59

the The Mind's Eye of large language

03:01

models so really we only have text

03:03

language we put it all together in what

03:06

is the large language models Mind's Eye

03:09

and then we come up with what is a

03:10

mental image so can we actually achieve

03:13

that with a large language model well

03:15

let's find out so here is conventional

03:18

prompting you have an input and then you

03:20

get an output and then we have more

03:21

advanced prompting techniques like Chain

03:23

of Thought So it's an input and then

03:25

walk me through thought by thought how

03:27

you get to the output and what we found

03:30

is when you use Chain of Thought

03:31

prompting and other prompting techniques

03:33

like reflection you actually improve the

03:35

performance of the large language model

03:38

pretty greatly actually then we have

03:40

visualization of thought we have the

03:41

input and then we ask it to have a

03:44

thought and to represent the

03:46

visualization at each step along the way

03:49

before we get to the output and this is

03:51

all theoretical I'm going to show you

03:52

actual examples of it in a second so

03:55

humans can enhance their spatial

03:56

awareness and inform Decisions by

03:58

creating mental images during the

03:59

spatial reasoning process similarly

04:02

large language models can create

04:04

internal mental images we propose the

04:06

visualization of thought prompting to

04:08

elicit The Mind's Eye of llms for

04:10

spatial reasoning so spatial reasoning

04:12

is super important in basically every

04:14

aspect of life whether you're driving

04:16

playing video games playing chess just

04:19

walking everything you're doing is using

04:21

spatial awareness as long as you're

04:23

interacting with your 3D World so let's

04:25

talk about visualization of thought vot

04:27

prompting to elicit this ability this

04:29

being spatial awareness this method

04:31

augments llms with a visual spatial

04:34

sketch pad to visualize their reasoning

04:36

steps and inform subsequent steps vot

04:39

adopts zero shot prompting instead of

04:40

relying on few shot demonstrations or

04:43

textto image visualization with clip to

04:46

evaluate the effectiveness of vot and

04:48

spatial reasoning we selected three

04:50

tasks that require spatial awareness in

04:52

llms including natural language

04:54

navigation visual navigation and visual

04:56

tiling and I'll explain what all three

04:58

of those things are we designed 2D grid

05:01

worlds using special characters as

05:03

enriched input formats for the llms in

05:06

visual navigation and visual tiling

05:08

tasks now remember large language models

05:11

can't interpret graphs like if we were

05:13

to put together a 2d tile and just pass

05:16

it to the large language model it

05:18

wouldn't really understand it we have to

05:20

represent that 2D space with natural

05:23

language and you'll see how they do it

05:25

so vot prompting proposed in this paper

05:27

consistently induces llms to visual uze

05:30

the reasoning steps and inform

05:31

subsequent steps and consequently this

05:34

approach achieved significant

05:36

performance improvements on the

05:37

corresponding tasks so let's look at

05:39

this we have a bunch of 2D grids right

05:41

here and they're of different sizes and

05:44

they have different objects within them

05:46

so let's look at this k equals 2 so the

05:48

house is the starting point and the

05:50

office is the ending point and what

05:52

we're going to do is we're going to ask

05:54

the large language model to navigate

05:56

step by step from the house to the

05:59

office it's easy for humans to do this

06:01

right go right go right go up go up and

06:04

that's it and obviously we can get more

06:06

complicated but it's still super easy in

06:08

fact we don't really even need to go

06:10

step by step we can kind of just look at

06:12

it and go all the way through just by

06:14

thinking about it but if we had to we

06:16

could describe it up up left left up up

06:19

Etc but this is spatial awareness this

06:21

is spatial reasoning and this is very

06:23

difficult for large language models to

06:25

date but not anymore so spatial

06:27

reasoning refers to the ability to

06:29

comprehend and reason about the spatial

06:30

relationships among objects their

06:32

movements and interactions and these can

06:34

be applied in the context of technology

06:38

to navigation Robotics and autonomous

06:40

driving so here they say in this context

06:42

a square map is defined by a sequence of

06:45

random walk instructions along

06:46

corresponding objects denoted as and

06:49

then they actually just give the

06:50

algorithm to denote the graph and the

06:54

walking path then we have visual

06:56

navigation so visual navigation task

06:58

presents a synthetic 2D grid world to

07:00

llm challenging it to navigate using

07:02

visual cues the model must generate

07:04

navigation instructions to move in four

07:06

directions left right up down what we

07:07

just talked about to reach the

07:09

destination from the starting point

07:10

while avoiding obstacles this involves

07:13

two subtests route planning and Next

07:15

Step prediction requiring multihop

07:17

spatial reasoning while the former is

07:19

more complex and here is the formulation

07:22

of it so it's represented by a formula

07:24

rather than just passing in like an

07:26

image of that 2D grid then we have

07:29

visual ual tiling and that is what we're

07:31

seeing right here in these examples and

07:33

let me just talk about that for a second

07:35

polyomino tiling is a classic spatial

07:38

reasoning challenge we extend this

07:40

concept to test the lm's ability to

07:42

comprehend organize and reason with

07:43

shapes in a confined area so essentially

07:46

you have a grid with different colors

07:48

different shapes really and you are

07:51

tasked with finding a place for a new

07:53

object now if we just look at this we

07:55

can tell that within this grid right

07:58

here we can place

08:00

this red 4X one or

08:03

1x4 object right here okay so that is

08:07

essentially what this test is

08:09

accomplishing now the really important

08:11

part of vot prompting is visualizing at

08:15

each step so it's kind of like Chain of

08:18

Thought we're not just saying okay do it

08:19

all at once it's I want to see a trace

08:22

of the path step by step as you go along

08:24

the way so we introduce vot prompting

08:27

and it just starts really simply

08:29

visualize the state after each reasoning

08:32

step this new paradigm for spatial

08:34

reasoning aims to generate reasoning

08:35

traces and visualizations in an

08:37

interleaved manner so let's look at the

08:40

one on the left first so this is visual

08:43

navigation we've already seen this so we

08:45

have the house right here and the llm is

08:47

supposed to navigate through all of

08:49

these empty squares so the ones with

08:52

gates in them cannot be navigated

08:53

through all the way down to the office

08:56

and what we're seeing down here is the L

08:59

I'm doing that and doing it step by step

09:01

so step one move right step two move

09:05

down step three move left move down move

09:08

left move down and they reached it same

09:11

with visual tiling and what we're doing

09:13

is we provide it with this grid and

09:16

three different objects so 1x4 this is

09:19

essentially Tetris objects and we say

09:21

can you fit all of them into this grid

09:24

and so it says okay well let's look at I

09:26

where does that go then let's look at l

09:29

where does that go and then let's look

09:30

at T where does that go and then it is

09:32

able to accomplish that and get them all

09:34

in there and then here we have natural

09:36

language navigation so we describe a 3X3

09:39

grid and we tell it step by step what it

09:42

needs to do and we're actually giving it

09:44

the steps and then at the end we say

09:46

okay where are you what did you find and

09:48

so we're visualizing each step and the

09:51

one with stars on it is where the large

09:53

language model thinks it is in the

09:55

current state so step two it's w step

09:58

three it's c all the way up to step

10:00

seven s and so on and then finally we're

10:02

at C and so they tested four different

10:05

versions and they're using GPT 4 so

10:08

first gp4 with Chain of Thought So let's

10:10

think step by step GPT 4 without

10:13

visualization so don't use visualization

10:15

the techniques that we're talking about

10:16

today let's think step by step then gp24

10:20

with vision so the ability to interpret

10:23

what's in an image let's think step by

10:25

step and then gbt 4 with v so visualize

10:29

the state after each reasoning step now

10:32

let's look at the performance so as you

10:34

can see all the Bold across the board is

10:36

where it performed best so first for

10:39

route planning we have the completing

10:40

rate and we have GPT 4 with vot as the

10:45

best then we have the success rate far

10:49

superior nearly 50% greater than the

10:52

second place GPT 4 without visualization

10:55

Next Step prediction visual tiling and

10:58

natural language navig ation across the

11:00

board vot prompting technique just wins

11:04

it's really impressive so does that mean

11:06

that different prompting techniques

11:08

actually affect the outcome well yeah I

11:10

mean that's obvious right so what it

11:12

says here is in the setting gp4 coot

11:15

Chain of Thought without explicit

11:17

visualization prompts it demonstrated

11:20

noticeable tracking rate across almost

11:22

all tasks except route planning the fact

11:25

implies that llm innately exhibit this

11:28

capability of visual State tracking when

11:30

spatial temporal simulation is necessary

11:33

for reasoning and in this figure we're

11:36

also seeing the difference between

11:37

asking it to visualize and output the

11:40

visualization at each step along the way

11:43

versus just at least one step so here is

11:46

the complete tracking rate which means

11:47

it's visualizing at every single step

11:49

route planning completely dominates for

11:52

Next Step prediction does a lot better

11:54

visual tiling and so on natural language

11:57

so this purple is gb4 with vot on the

12:00

right side is partial tracking rate

12:02

which means at least one step had the

12:05

visualization and what we're seeing here

12:06

is similar results except for Next Step

12:09

prediction in which gp4 with coot Chain

12:12

of Thought actually performs pretty darn

12:14

well so one last thing before I actually

12:16

show you the examples what are the

12:17

limitations so both mental images and

12:20

visual State tracking rely on the

12:21

emerging ability of advanced llms

12:24

therefore it might cause performance

12:26

deterioration in less Advanced language

12:28

models or more challenging tasks so here

12:31

is the project it's called Pi win

12:34

assistant and it's described as the

12:36

first open source large action model

12:38

generalist artificial narrow

12:40

intelligence that controls completely

12:42

human user interfaces only by using

12:44

natural language so they reference this

12:46

paper this is actually how I found the

12:48

paper and it uses the same techniques to

12:50

control a Windows environment so they

12:52

give you this cute little character in

12:53

the right and you can essentially task

12:55

it with anything you want so let's look

12:58

at a few examples all right so what

13:00

we're going to be seeing is an example

13:02

in the windows environment we have this

13:04

little assistant right there and you can

13:06

tell it to do different things so the

13:08

first thing we're going to tell it or

13:09

the first thing that the video tells it

13:10

is to open Firefox open Firefox click on

13:15

YouTube click on YouTube so it's giving

13:17

it a series of things to do CLI onto the

13:19

element without visioning

13:23

context okay so it clicked on YouTube

13:26

Okay so let's take a look at actually

13:27

what's happening so you click clicked on

13:29

the assistant you dragg me so that's

13:30

just the person dragging the little

13:32

assistant around then we say open

13:33

Firefox so it responds with clicking on

13:36

click on YouTube selected application

13:38

Mozilla Firefox then AI decision

13:41

coordinates it actually finds the

13:42

coordinates then it says clicking on the

13:44

search input and so on so let's keep

13:47

watching so there we go type Rick Roll

13:50

type Rick

13:51

Roll click on search click on search

13:54

clicking onto the element without

13:56

visioning context

13:59

click on the second video okay so we're

14:02

just telling it what to do and it's able

14:04

to do that this is essentially open

14:06

interpreter but it works really really

14:08

well clicking onto the element without

14:11

visioning

14:13

context and there we go so it was able

14:16

to do that I'm going to mute it because

14:18

I don't want to get copyright stried and

14:20

it's playing the video now so it's just

14:22

step by step telling it exactly what it

14:23

needs to do there it said to mute it so

14:25

it clicked on the mute button again it

14:27

has no training as to what is on the

14:31

screen or how to click it's figuring it

14:33

out as it goes and it's asking to

14:35

visualize it at each step so very

14:38

impressive all right so let's look at

14:39

this next example by the way this is an

14:42

awesome background so the user has given

14:44

it the instruction make a new post on

14:46

Twitter saying hello world and a brief

14:48

greeting explaining your an artificial

14:50

intelligence and then here's the prompt

14:52

here's another prompt it is analyzing

14:55

what to do generating the test case and

14:57

then it actually interest inly iterates

15:00

on the prompt automatically and then it

15:02

says current status so that is where

15:05

it's representing what it currently

15:07

understands it's basically the

15:08

visualization at each step so let's keep

15:11

watching so add SPAC map click on what

15:13

is happening okay then it generates the

15:16

actions right here so step click on the

15:18

browser address bar enter twitter.com

15:21

wait for the Twitter homepage to load so

15:24

it's giving the entire set of actions it

15:27

needs to accomplish and it's going to go

15:29

through it step by step so it's actually

15:31

asking it to do the planning up front

15:33

well let's watch it so selected element

15:35

locate the address it shows the

15:38

coordinates of the address bar clicks on

15:40

it enters twitter.com there we go okay

15:43

found the address bar right there

15:46

entered the tweet and then hopefully

15:48

they're going to push post but here we

15:50

go we can see every single step along

15:52

the

15:54

way very cool so let's look at some of

15:56

the cases these are proven cases working

15:59

cases so open a new tab with the song

16:01

click on the button send a list of steps

16:03

to make a joke about engineers whilst

16:05

making it essay and so on and so forth

16:07

so it's actually a lot of really cool

16:10

implementations of this so I encourage

16:13

you to check this out read the research

16:14

paper if you're interested if you want

16:16

to see me do a full tutorial of pwin

16:18

assistant let me know in the comments

16:19

I'm happy to do that if you enjoyed this

16:21

video please give a like And subscribe

16:23

and I'll see you in the next one

Rate This

5.0 / 5 (0 votes)

Related Tags
الذكاء الاصطناعيالتحكم اللغويالفهم المكانيمicrosoftالبرمجيات مفتوحةالأبحاث التقنيةالتفاعل الطبيعيالذكاء الاصطناعي العامالتطبيقات الشخصيةالتوجيه المنطقيالتحليل البياني
Do you need a summary in English?