"VoT" Gives LLMs Spacial Reasoning AND Open-Source "Large Action Model"
Summary
TLDRتستعرض هذه الحلقة تقنية جديدة من مايكروسوفت تتيح التحكم في تطبيقات نظام ويندوز من خلال الأوامر الصوتية، مشابهة لتلك الموجودة في بيئة الأندرويد. تُعرف هذه التقنية باسم 'التصور البصري للفكر' والتي تُعزز القدرة على التفكير المكاني في النماذج اللغوية الكبيرة، مما يُتيح إمكانيات جديدة للتفاعل مع الأجهزة بطريقة أكثر فعالية. يتم تقديم نظرة عميقة حول كيفية تطبيق هذه التقنية في نماذج معينة مثل التنقل اللغوي الطبيعي والملاحة البصرية، وتُقدم أيضاً نظرة على مشروع مفتوح المصدر يمكن من خلاله تجربة هذه التكنولوجيا.
Takeaways
- 📜 Microsoft released an open-source project called Pi-Win Assistant, which is a large action model for the Windows environment, similar to how the Rabbit, R1 controls applications within Android using natural language.
- 🔍 The project is accompanied by a research paper titled 'Visualization of Thought Elicits Spatial Reasoning and Large Language Models', which outlines how Microsoft achieved spatial reasoning capabilities in large language models.
- 🧠 Spatial reasoning is the ability to visualize relationships between objects in a 3D or 2D environment, which has been a historically weak area for large language models.
- 💡 Yan LeCun, the lead of Meta AI, has previously stated that spatial reasoning is a core missing feature that would prevent us from reaching AGI (Artificial General Intelligence).
- 📈 The paper demonstrates that it is possible to achieve spatial reasoning with large language models using a technique called 'visualization of thought' (VOT) prompting.
- 📈 VOT prompting involves asking the model to visualize and represent its reasoning steps at each stage before reaching the output, which significantly improves performance on spatial reasoning tasks.
- 📊 The research tested three tasks requiring spatial awareness: natural language navigation, visual navigation, and visual tiling, using 2D grid worlds represented in natural language for the models to understand.
- 🚀 The Pi-Win Assistant project allows users to control a Windows environment using natural language, showcasing the practical application of the research findings.
- 📚 The research paper and the open-source project are available for anyone interested in exploring or utilizing the advancements in large language models for spatial reasoning.
- 🔑 The success of VOT prompting in enhancing spatial reasoning in large language models could be a significant step towards more sophisticated AI capabilities.
- ⚙️ The limitations of the approach include potential performance deterioration in less advanced language models or more challenging tasks, highlighting the need for further development and refinement.
Q & A
ما هي الميزة الرئيسية لـ 'Pi win assistant'؟
-Pi win assistant هو أول نموذج مفتوح المصدر لـ 'large action model' يتحكم في واجهات المستخدم البشرية باللغة الطبيعية.
ماذا يشير 'visualization of thought' إلى في السياق الحالي؟
-visualization of thought يشير إلى تقنية تعزز النماذج الكبيرة لغوياً بلوحة رسوم مرئية لتصور خطوات التفكير وتوجيه الخطوات التالية.
ماذا يشير النص بـ 'The Mind's Eye'؟
-The Mind's Eye يشير إلى قدرة البشر على خلق صور ذهنية لأشياء غير مرئية، مما يتيح تخيل عالم غير مرئي.
ما هي الأهمية الأساسية لفهم المكان (spatial reasoning) في الذكاء الاصطناعي؟
-الفهم المكاني هو مهارة أساسية تتيح التفاعل مع العالم ثلاثي الأبعاد، وهو ضروري في التنقل التكنولوجي والروبوتات والقود الذاتي.
كيف يساعد 'visualization of thought prompting' في تحسين أداء النماذج الكبيرة؟
-visualization of thought prompting يساعد في تحسين أداء النماذج الكبيرة من خلال توليد تسلسلات التفكير المرئي وتصورات بطريقة متداخلة.
ما هي المهام الثلاثة التي تم اختبارها في الدراسة لتقييم القدرة على التفكير المكاني؟
-المهام الثلاثة هي التنقل اللغوي، التنقل المرئي، والتخطيط التiling المرئي.
ماذا يشير النص بـ 'Chain of Thought' إلى؟
-Chain of Thought هو تقنية تعزز الأداء من قبل النماذج الكبيرة للغة من خلال تقديم التفكير خطوة بخطوة بدلاً من تقديم الإجابة مباشرة.
ما هي التأثيرات المحتملة لتقنيات التحفيز المختلفة على النتيجة؟
-تقنيات التحفيز المختلفة مثل visualization of thought prompting و Chain of Thought يمكن أن يؤثروا بشكل إيجابي على الأداء في المهام المطلوبة.
ما هي القيود المحتملة لتقنيات التصور الداخلي والتتبع المكاني؟
-التصور الداخلي والتتبع المكاني يعتمدان على قدرات النماذج اللغوية المتقدمة، لذا قد يتسبب في تحسن الأداء في المهام الأكثر تحدياً.
كيف يستطيع 'Pi win assistant' التعامل مع واجهات المستخدم البشرية؟
-Pi win assistant يستطيع التعامل مع واجهات المستخدم البشرية من خلال تحليل التعليمات وإنشاء خطوات التنفيذ وتنفيذها خطوة بخطوة.
ماذا يشير النص بـ 'open-source project' إلى؟
-open-source project يشير إلى مشروع مفتوح المصدر يمكن لأي شخص تحميله واستخدامه أو تطويره، وهو يتيح للمجتمع المساهمة في تطوير المشروع.
Outlines
📘 Introduction to Open Source Large Language Models for Windows
The video begins with an introduction to a significant development in the field of AI: an open-source version of a large action model for the Windows environment, similar to how the Rabbit, R1 controls applications within the Android environment using natural language. Microsoft has released both a research paper detailing their methodology and an open-source project that can be downloaded and utilized immediately. The paper, 'Visualization of Thought Elicits Spatial Reasoning and Large Language Models,' is discussed, which explores enhancing large language models with spatial reasoning capabilities, a feature historically lacking in such models and considered by Yan LeCun, the head of Meta AI, as a crucial missing piece for achieving AGI (Artificial General Intelligence). The paper demonstrates that spatial reasoning is indeed possible with large language models, using the example of visualizing a path from the North Pole to illustrate the concept.
🎯 Enhancing LLMs with Spatial Reasoning through Visualization of Thought
The video continues to delve into the concept of spatial reasoning, which is integral to human cognition and involves the ability to visualize relationships between objects in a 2D or 3D environment. The paper proposes a method called 'Visualization of Thought' (VOT) prompting, which aims to enhance large language models (LLMs) with a visual spatial sketch pad to visualize their reasoning steps. This method is tested on three tasks requiring spatial awareness: natural language navigation, visual navigation, and visual tiling. The video explains how LLMs can be improved by using VOT prompting, which involves asking the model to visualize and represent its thought process at each step before reaching the output. The results show significant performance improvements in tasks that require spatial reasoning.
🤖 Pi-Win Assistant: An Open Source Large Action Model for Windows
The video introduces 'Pi-Win Assistant,' an open-source large action model that controls human user interfaces using natural language, without the need for prior training on specific tasks. The assistant is demonstrated performing a series of tasks within a Windows environment, such as opening Firefox and navigating to YouTube, by following a sequence of natural language instructions. The assistant's ability to interpret and act on these instructions is based on the techniques discussed in the research paper, showcasing its capability to visualize and execute tasks step by step. The video also highlights the limitations of such systems, noting that they rely on the emerging capabilities of advanced LLMs and may not perform as well on less advanced models or more complex tasks.
📈 Performance Evaluation and Practical Applications of LLMs
The video concludes with a discussion on the performance evaluation of different versions of GPT-4, including those with and without visualization capabilities, across various tasks. It is shown that the visualization of thought (VOT) prompting technique significantly outperforms other methods in tasks like route planning, next step prediction, visual tiling, and natural language navigation. The video also touches on the limitations of mental images and visual state tracking in less advanced language models and more challenging tasks. Finally, the video provides examples of practical applications of the Pi-Win Assistant, such as making a new Twitter post, and encourages viewers to explore the research paper and consider a full tutorial on the assistant.
Mindmap
Keywords
💡大型语言模型
💡空间推理
💡可视化思维提示(VOT)
💡
💡自然语言导航
💡视觉导航
💡视觉平铺
💡Pi Win Assistant
💡链式思考(Chain of Thought)
💡零样本提示(Zero Shot Prompting)
💡性能提升
💡限制
Highlights
Microsoft has released an open-source version of a large action model for the Windows environment, similar to how the Rabbit, R1 controls applications within the Android environment using natural language.
The research paper 'Visualization of Thought Elicits Spatial Reasoning and Large Language Models' outlines a method to give large language models spatial reasoning capabilities.
Spatial reasoning is the ability to visualize relationships between objects in a 3D or 2D environment, a feature historically lacking in large language models.
The paper demonstrates that it's possible to achieve spatial reasoning with large language models, contrary to previous beliefs.
The 'Mind's Eye' concept allows humans to create mental images of unseen objects and actions, a cognitive capacity that the research aims to replicate in large language models.
Visualization of Thought (VOT) prompting is proposed to elicit the 'Mind's Eye' of large language models for spatial reasoning.
VOT adopts zero-shot prompting and is evaluated for effectiveness in tasks requiring spatial awareness, such as natural language navigation, visual navigation, and visual tiling.
The research introduces a 2D grid world for visual navigation and tiling tasks, using special characters as enriched input formats for large language models.
VOT prompting significantly improves the performance of large language models on corresponding spatial reasoning tasks.
Spatial reasoning is crucial for various aspects of life, including navigation, robotics, and autonomous driving.
The paper presents an algorithm to denote the graph and walking path for spatial reasoning challenges.
The Pi-Win Assistant project is an example of an open-source large action model that controls human user interfaces using natural language, leveraging the techniques from the research paper.
The Pi-Win Assistant can perform tasks within the Windows environment, such as opening applications and navigating web pages, through step-by-step instructions.
The assistant uses visualization at each step, similar to the Chain of Thought prompting, to generate a trace of the path taken to reach the solution.
The research shows that VOT prompting technique outperforms other methods across various tasks, indicating its effectiveness in enhancing spatial reasoning in large language models.
Limitations of the approach include potential performance deterioration in less advanced language models or more challenging tasks.
The research paper and the Pi-Win Assistant project provide a foundation for future developments in enhancing spatial reasoning capabilities of large language models.
Transcripts
today we have an opsource large action
model so very similar to how the rabbit
R1 can control applications within the
Android environment just by speaking
natural language we now have a
completely open- Source version of that
for the windows environment released by
Microsoft so not only did Microsoft
release a research paper outlining how
they were able to achieve it they also
have an open-source project which you
can download and use right away and I'm
going to show you that today so first
let's go over the white paper this is
called visualization of thought elicits
spatial reasoning and large language
models and essentially what this paper
describes is a way to give large
language models spatial reasoning and if
you're not familiar with what spatial
reasoning means it's basically just the
ability to visualize the relationships
in a 3D environment or even a 2d
environment between different objects
and this is something that large
language models have historically done
really poorly and the lead of meta AI
Yan laon has talked about about this as
being a core missing feature of large
language models that will prevent us
from reaching AGI but in this paper they
show that it's actually possible to get
spatial reasoning out of large language
models so let me give you an example of
what spatial reasoning is in your mind
think about this you're standing at a
point on the North Pole and you start
walking and you walk 50 yards in One
Direction then you turn left and then
you continue to walk indefinitely now
think about this if you continued
walking would you ever cross over that
initial point now you're doing all of
this spatial reasoning in your head
through what's called your mind's eye
language isn't really involved when
you're thinking through this problem and
that is what spatial reasoning is and
that is why Yan laon thinks spatial
reasoning is not possible with language
models alone but according to this paper
it definitely is so let me get into it
and remember stick around to after this
because I'm actually going to show it to
you in action in an open source project
so this is out of Microsoft research so
in the beginning it talks about how
large language models are really great
however their abilities in spatial
reasoning a crucial aspect of human
cognition remain relatively unexplored
humans possess a remarkable ability to
create mental images of unseen objects
and actions through a process known as
The Mind's Eye enabling the imagination
of the Unseen World inspired by this
cognitive capacity we propose
visualization of thought prompting and
I'm going to show you why this will
translate into a large action model
because right now it's called
visualization of thought but if we take
this technique and we apply it to a user
interface we can actually control that
user interface and that's essentially
what a large action model is so let's
look at this diagram this is what is
happening in the human mind we have
visuals we have verbal language we put
it all together in what is called The
Mind's Eye and then we put together a
mental image of whatever we're thinking
about now on the right side is what is
the The Mind's Eye of large language
models so really we only have text
language we put it all together in what
is the large language models Mind's Eye
and then we come up with what is a
mental image so can we actually achieve
that with a large language model well
let's find out so here is conventional
prompting you have an input and then you
get an output and then we have more
advanced prompting techniques like Chain
of Thought So it's an input and then
walk me through thought by thought how
you get to the output and what we found
is when you use Chain of Thought
prompting and other prompting techniques
like reflection you actually improve the
performance of the large language model
pretty greatly actually then we have
visualization of thought we have the
input and then we ask it to have a
thought and to represent the
visualization at each step along the way
before we get to the output and this is
all theoretical I'm going to show you
actual examples of it in a second so
humans can enhance their spatial
awareness and inform Decisions by
creating mental images during the
spatial reasoning process similarly
large language models can create
internal mental images we propose the
visualization of thought prompting to
elicit The Mind's Eye of llms for
spatial reasoning so spatial reasoning
is super important in basically every
aspect of life whether you're driving
playing video games playing chess just
walking everything you're doing is using
spatial awareness as long as you're
interacting with your 3D World so let's
talk about visualization of thought vot
prompting to elicit this ability this
being spatial awareness this method
augments llms with a visual spatial
sketch pad to visualize their reasoning
steps and inform subsequent steps vot
adopts zero shot prompting instead of
relying on few shot demonstrations or
textto image visualization with clip to
evaluate the effectiveness of vot and
spatial reasoning we selected three
tasks that require spatial awareness in
llms including natural language
navigation visual navigation and visual
tiling and I'll explain what all three
of those things are we designed 2D grid
worlds using special characters as
enriched input formats for the llms in
visual navigation and visual tiling
tasks now remember large language models
can't interpret graphs like if we were
to put together a 2d tile and just pass
it to the large language model it
wouldn't really understand it we have to
represent that 2D space with natural
language and you'll see how they do it
so vot prompting proposed in this paper
consistently induces llms to visual uze
the reasoning steps and inform
subsequent steps and consequently this
approach achieved significant
performance improvements on the
corresponding tasks so let's look at
this we have a bunch of 2D grids right
here and they're of different sizes and
they have different objects within them
so let's look at this k equals 2 so the
house is the starting point and the
office is the ending point and what
we're going to do is we're going to ask
the large language model to navigate
step by step from the house to the
office it's easy for humans to do this
right go right go right go up go up and
that's it and obviously we can get more
complicated but it's still super easy in
fact we don't really even need to go
step by step we can kind of just look at
it and go all the way through just by
thinking about it but if we had to we
could describe it up up left left up up
Etc but this is spatial awareness this
is spatial reasoning and this is very
difficult for large language models to
date but not anymore so spatial
reasoning refers to the ability to
comprehend and reason about the spatial
relationships among objects their
movements and interactions and these can
be applied in the context of technology
to navigation Robotics and autonomous
driving so here they say in this context
a square map is defined by a sequence of
random walk instructions along
corresponding objects denoted as and
then they actually just give the
algorithm to denote the graph and the
walking path then we have visual
navigation so visual navigation task
presents a synthetic 2D grid world to
llm challenging it to navigate using
visual cues the model must generate
navigation instructions to move in four
directions left right up down what we
just talked about to reach the
destination from the starting point
while avoiding obstacles this involves
two subtests route planning and Next
Step prediction requiring multihop
spatial reasoning while the former is
more complex and here is the formulation
of it so it's represented by a formula
rather than just passing in like an
image of that 2D grid then we have
visual ual tiling and that is what we're
seeing right here in these examples and
let me just talk about that for a second
polyomino tiling is a classic spatial
reasoning challenge we extend this
concept to test the lm's ability to
comprehend organize and reason with
shapes in a confined area so essentially
you have a grid with different colors
different shapes really and you are
tasked with finding a place for a new
object now if we just look at this we
can tell that within this grid right
here we can place
this red 4X one or
1x4 object right here okay so that is
essentially what this test is
accomplishing now the really important
part of vot prompting is visualizing at
each step so it's kind of like Chain of
Thought we're not just saying okay do it
all at once it's I want to see a trace
of the path step by step as you go along
the way so we introduce vot prompting
and it just starts really simply
visualize the state after each reasoning
step this new paradigm for spatial
reasoning aims to generate reasoning
traces and visualizations in an
interleaved manner so let's look at the
one on the left first so this is visual
navigation we've already seen this so we
have the house right here and the llm is
supposed to navigate through all of
these empty squares so the ones with
gates in them cannot be navigated
through all the way down to the office
and what we're seeing down here is the L
I'm doing that and doing it step by step
so step one move right step two move
down step three move left move down move
left move down and they reached it same
with visual tiling and what we're doing
is we provide it with this grid and
three different objects so 1x4 this is
essentially Tetris objects and we say
can you fit all of them into this grid
and so it says okay well let's look at I
where does that go then let's look at l
where does that go and then let's look
at T where does that go and then it is
able to accomplish that and get them all
in there and then here we have natural
language navigation so we describe a 3X3
grid and we tell it step by step what it
needs to do and we're actually giving it
the steps and then at the end we say
okay where are you what did you find and
so we're visualizing each step and the
one with stars on it is where the large
language model thinks it is in the
current state so step two it's w step
three it's c all the way up to step
seven s and so on and then finally we're
at C and so they tested four different
versions and they're using GPT 4 so
first gp4 with Chain of Thought So let's
think step by step GPT 4 without
visualization so don't use visualization
the techniques that we're talking about
today let's think step by step then gp24
with vision so the ability to interpret
what's in an image let's think step by
step and then gbt 4 with v so visualize
the state after each reasoning step now
let's look at the performance so as you
can see all the Bold across the board is
where it performed best so first for
route planning we have the completing
rate and we have GPT 4 with vot as the
best then we have the success rate far
superior nearly 50% greater than the
second place GPT 4 without visualization
Next Step prediction visual tiling and
natural language navig ation across the
board vot prompting technique just wins
it's really impressive so does that mean
that different prompting techniques
actually affect the outcome well yeah I
mean that's obvious right so what it
says here is in the setting gp4 coot
Chain of Thought without explicit
visualization prompts it demonstrated
noticeable tracking rate across almost
all tasks except route planning the fact
implies that llm innately exhibit this
capability of visual State tracking when
spatial temporal simulation is necessary
for reasoning and in this figure we're
also seeing the difference between
asking it to visualize and output the
visualization at each step along the way
versus just at least one step so here is
the complete tracking rate which means
it's visualizing at every single step
route planning completely dominates for
Next Step prediction does a lot better
visual tiling and so on natural language
so this purple is gb4 with vot on the
right side is partial tracking rate
which means at least one step had the
visualization and what we're seeing here
is similar results except for Next Step
prediction in which gp4 with coot Chain
of Thought actually performs pretty darn
well so one last thing before I actually
show you the examples what are the
limitations so both mental images and
visual State tracking rely on the
emerging ability of advanced llms
therefore it might cause performance
deterioration in less Advanced language
models or more challenging tasks so here
is the project it's called Pi win
assistant and it's described as the
first open source large action model
generalist artificial narrow
intelligence that controls completely
human user interfaces only by using
natural language so they reference this
paper this is actually how I found the
paper and it uses the same techniques to
control a Windows environment so they
give you this cute little character in
the right and you can essentially task
it with anything you want so let's look
at a few examples all right so what
we're going to be seeing is an example
in the windows environment we have this
little assistant right there and you can
tell it to do different things so the
first thing we're going to tell it or
the first thing that the video tells it
is to open Firefox open Firefox click on
YouTube click on YouTube so it's giving
it a series of things to do CLI onto the
element without visioning
context okay so it clicked on YouTube
Okay so let's take a look at actually
what's happening so you click clicked on
the assistant you dragg me so that's
just the person dragging the little
assistant around then we say open
Firefox so it responds with clicking on
click on YouTube selected application
Mozilla Firefox then AI decision
coordinates it actually finds the
coordinates then it says clicking on the
search input and so on so let's keep
watching so there we go type Rick Roll
type Rick
Roll click on search click on search
clicking onto the element without
visioning context
click on the second video okay so we're
just telling it what to do and it's able
to do that this is essentially open
interpreter but it works really really
well clicking onto the element without
visioning
context and there we go so it was able
to do that I'm going to mute it because
I don't want to get copyright stried and
it's playing the video now so it's just
step by step telling it exactly what it
needs to do there it said to mute it so
it clicked on the mute button again it
has no training as to what is on the
screen or how to click it's figuring it
out as it goes and it's asking to
visualize it at each step so very
impressive all right so let's look at
this next example by the way this is an
awesome background so the user has given
it the instruction make a new post on
Twitter saying hello world and a brief
greeting explaining your an artificial
intelligence and then here's the prompt
here's another prompt it is analyzing
what to do generating the test case and
then it actually interest inly iterates
on the prompt automatically and then it
says current status so that is where
it's representing what it currently
understands it's basically the
visualization at each step so let's keep
watching so add SPAC map click on what
is happening okay then it generates the
actions right here so step click on the
browser address bar enter twitter.com
wait for the Twitter homepage to load so
it's giving the entire set of actions it
needs to accomplish and it's going to go
through it step by step so it's actually
asking it to do the planning up front
well let's watch it so selected element
locate the address it shows the
coordinates of the address bar clicks on
it enters twitter.com there we go okay
found the address bar right there
entered the tweet and then hopefully
they're going to push post but here we
go we can see every single step along
the
way very cool so let's look at some of
the cases these are proven cases working
cases so open a new tab with the song
click on the button send a list of steps
to make a joke about engineers whilst
making it essay and so on and so forth
so it's actually a lot of really cool
implementations of this so I encourage
you to check this out read the research
paper if you're interested if you want
to see me do a full tutorial of pwin
assistant let me know in the comments
I'm happy to do that if you enjoyed this
video please give a like And subscribe
and I'll see you in the next one
5.0 / 5 (0 votes)
Merge Models Locally While Fine-Tuning on Custom Data Locally - LM Cocktail
ESP32 precision GPS receiver (incl. RTK-GPS Tutorial). How to earn money with it (DePIN)
When Everything Goes Wrong | Funny Fails
THE MOMENT LEICESTER CITY WON THE PREMIER LEAGUE AT STAMFORD BRIDGE!
Driving my New Mustang Until I Total It
The Adventures of Paddington Bear - The Great Escape | Classic Cartoons for Kids HD