New OpenAI Model 'Imminent' and AI Stakes Get Raised (plus Med Gemini, GPT 2 Chatbot and Scale AI)

AI Explained
2 May 202420:03

Summary

TLDRThe video script discusses recent developments in the field of AI, focusing on the anticipation of new OpenAI models, possibly named GPT 4.5, which are expected to be released soon. The discussion includes insights from company insiders and government officials, as well as the significance of two newly released papers totaling 90 pages. The script also explores the performance of AI models on benchmarks, particularly in mathematical reasoning, and the potential of Google's Med Gemini model in the medical field. Med Gemini is highlighted for its ability to provide medical answers competitive with doctors and assist in areas like surgery, showcasing the potential for AI to reduce medical errors and improve patient outcomes.

Takeaways

  • 🤖 **Imminent AI Model Release**: Insiders suggest an imminent release of new OpenAI models, possibly named GPT 4.5, optimized for reasoning and planning.
  • 🔍 **AI Safety and Oversight**: There is a lack of UK government safety testing for the latest AI models, despite promises from major companies like Meta and OpenAI.
  • 📈 **Iterative Deployment**: OpenAI is likely to release an iterative model before GPT 5, focusing on gradual rollout and societal influence on AI systems.
  • 🧠 **Generalization in Large Models**: Larger models tend to generalize better, even if they have seen questions in their training data, indicating a potential for 'Elementary reasoning ability'.
  • 🚀 **Med Gemini's Advancements**: Google's Med Gemini models are competitive with doctors in providing medical answers, showcasing significant innovation in AI for healthcare.
  • 🔋 **Energy and Compute Constraints**: The race for AI model development may soon face energy and data center constraints, affecting the continuous investment in new models.
  • 📚 **Data Set Importance**: The performance of AI models is heavily influenced by the quality of their training data, suggesting that with enough compute power and a good data set, top performance can be achieved.
  • 🧐 **Benchmark Contamination**: Some models may perform well on benchmarks due to having seen similar questions in their training data, which can skew performance metrics.
  • 📉 **Model Limitations**: Despite advancements, there are still limitations in how much AI models can generalize, as seen in their performance on basic high school math questions.
  • 🌐 **Web Search Integration**: Med Gemini uses web search integration to resolve uncertainty in answers, demonstrating the potential of combining AI with external data sources.
  • ⚖️ **Ethical Deployment of AI in Medicine**: The development of AI in medical diagnosis raises ethical questions about the point at which it becomes necessary to deploy AI to assist clinicians to reduce medical errors.

Q & A

  • What is the significance of the rumored release of new OpenAI models?

    -The rumored release of new OpenAI models is significant because it suggests advancements in AI technology that could potentially impact various sectors, including how AI systems are rolled out and their interaction with society.

  • Why did the author test the GPT2 chatbot instead of claiming that AGI has arrived?

    -The author chose to test the GPT2 chatbot to provide a more measured and evidence-based response rather than making sensational claims about the arrival of Artificial General Intelligence (AGI) without proper evaluation.

  • What is the issue with the AI safety summit held in Bletchley?

    -The issue is that major AI companies like Meta and OpenAI promised the UK government that they could safety test their latest models before release, but this hasn't been done, which raises concerns about the transparency and safety of AI model deployments.

  • Why is the author skeptical about the name 'GPT 5' for the next OpenAI model?

    -The author is skeptical about the name 'GPT 5' because of hints and insider information suggesting that OpenAI might release an iterative model such as GPT 4.5 before a major release like GPT 5.

  • What does the author imply about the importance of data in AI model performance?

    -The author implies that the quality and size of the dataset used for training are crucial for AI model performance, potentially allowing for 'Brute Force' performance improvements with enough computational power and a quality dataset.

  • What is the controversy surrounding the benchmark tests for AI models?

    -The controversy is that some AI models may have been exposed to benchmark questions during their training, leading to artificially high performance results. This issue is known as 'contamination' and affects the reliability of benchmark tests.

  • How does the author describe the potential impact of Med Gemini on the medical field?

    -The author describes Med Gemini as a potentially groundbreaking tool in the medical field, as it can provide medical answers competitive with doctors and assist in areas like surgery, which could significantly reduce medical errors and improve patient outcomes.

  • What is the main concern regarding the deployment of AI in sensitive areas like medicine?

    -The main concern is the ethical and safety implications of deploying AI in medicine. There is a need to ensure that AI models are accurate and reliable enough to assist or potentially replace human clinicians in diagnosing diseases and assisting in procedures.

  • Why did the author find the performance of models like Claude 3 and Opus on basic high school questions surprising?

    -The author found it surprising because these models can perform well on complex expert reasoning tasks, yet they struggle with basic high school-level questions, indicating a potential limit in their generalization capabilities.

  • What is the significance of the long context abilities of the Gemini 1.5 series of models?

    -The significance is that these models can process and analyze extremely long documents, such as a 700,000-word electronic health record, which would be a daunting task for a human doctor, enhancing the potential utility of AI in medical diagnostics.

  • How does the author view the competition between Google and Microsoft in the medical AI field?

    -The author views the competition positively, as it drives innovation and improvements in AI capabilities within the medical field, potentially leading to better patient outcomes and more efficient healthcare systems.

  • What is the author's stance on the deployment of AI models like Med Gemini in clinical settings?

    -The author believes that once AI models like Med Gemini demonstrate unambiguous superiority in diagnosing diseases over human clinicians, it becomes unethical not to deploy them in assisting clinicians, considering the potential to save lives by reducing medical errors.

Outlines

00:00

🚀 Imminent Release of New OpenAI Models and AI Developments

The first paragraph discusses recent developments in AI, hinting at the imminent release of new models from OpenAI. It mentions an article from Politico about an AI safety summit where major AI companies like Meta and OpenAI promised the UK government early access to new models for safety testing. Insiders reveal that OpenAI is close to releasing a new model, possibly named GPT 4.5, optimized for reasoning and planning. The paragraph also references two papers that may be more significant than current rumors and discusses the testing of a mysterious GPT-2 chatbot that was showcased and then withdrawn.

05:02

🧐 Analysis of GPT-2 Chatbot Performance and Data's Role in AI

The second paragraph delves into the performance of the GPT-2 chatbot, which was tested by the author and compared to GPT 4 Turbo. It suggests that the data set used for training is crucial for AI performance, as highlighted by James Becker of OpenAI. The paragraph also discusses the importance of compute power and the potential for 'brute forcing' performance with sufficient resources. It touches on the supply constraints of GPUs and the recent release of a refined benchmark for testing mathematical reasoning capabilities of AI models, which revealed issues with data contamination and the generalization abilities of larger models.

10:02

🏥 Google's Med-Gemini: A Breakthrough in Medical AI Assistance

The third paragraph introduces Google's Med-Gemini, a significant advancement in medical AI that is competitive with doctors in providing medical answers. The paper outlines innovations such as inspecting model confidence, using search queries to resolve conflicts, and a fine-tuning loop. Med-Gemini has shown state-of-the-art performance in diagnosing diseases and has the potential to assist in surgery by analyzing video scenes in real-time. The paragraph also discusses the competitive nature of the field, with Google and Microsoft engaging in a positive rivalry to improve medical AI.

15:03

🤖 Ethical Considerations and Future Prospects of Medical AI

The fourth and final paragraph ponders the ethical implications of deploying AI in medicine, especially when it outperforms human clinicians in diagnostics. It raises the question of when it becomes unethical not to use AI in assisting clinicians, given its potential to reduce medical errors. The paragraph concludes by congratulating the team behind Med-Gemini and expressing optimism about the positive uses of AI, especially in contrast to other concerning autonomous AI deployments.

Mindmap

Keywords

💡AI safety

AI safety refers to the practices and research aimed at ensuring that artificial intelligence systems are developed and deployed in a manner that minimizes risks and maximizes benefits for society. In the video, it is mentioned in the context of an AI safety summit where major AI companies committed to allowing the UK government to safety test their latest models before release.

💡OpenAI models

OpenAI models refer to the series of artificial intelligence systems developed by OpenAI, a research lab focused on creating safe AGI (Artificial General Intelligence). The video discusses the anticipation of a new model release, possibly named GPT 4.5, which is expected to be optimized for reasoning and planning.

💡GPT 4.5

GPT 4.5 is speculated to be an upcoming version of OpenAI's language model series, which is expected to be an iterative improvement over GPT 4. The video suggests that GPT 4.5 might be released before GPT 5, focusing on enhanced reasoning abilities, and that it could be a significant update in the AI field.

💡Data set

A data set is a collection of data used for analysis or machine learning. In the context of the video, it is emphasized that the performance of AI models is heavily influenced by the quality and nature of the data set they are trained on. It is suggested that with enough computational power and a high-quality data set, one can achieve top performance in AI models.

💡Benchmarking

Benchmarking is the process of evaluating a product or system's performance using a set of standardized tests. In the video, it is discussed how AI models are tested for mathematical reasoning capabilities, and how issues like data contamination can affect the reliability of benchmark results.

💡Medical AI

Medical AI refers to the application of artificial intelligence in the healthcare sector, with the aim of enhancing diagnostics, treatment, and overall patient care. The video highlights the potential of Google's Med Gemini model, which is shown to be highly competent in providing medical answers and assisting in areas like surgery.

💡Contamination in benchmarks

Contamination in benchmarks refers to the issue where AI models have been exposed to the data used in benchmark tests during their training, leading to inflated performance metrics. The video discusses how this problem was identified and addressed in a new benchmark created by Scale AI.

💡Iterative deployment

Iterative deployment is the process of releasing new versions of a product incrementally, allowing for continuous improvement and feedback incorporation. The video mentions the preference for iterative deployment in AI, to avoid surprising the public and to allow for a gradual adjustment and influence over AI systems.

💡Compute

In the context of AI, compute refers to the computational resources, including processing power and memory, required to train and run AI models. The video discusses the importance of compute in achieving state-of-the-art performance in AI models and the potential financial and energy constraints associated with it.

💡Generalization in AI

Generalization in AI is the ability of a model to apply learned knowledge to new, unseen data or situations. The video discusses how larger AI models tend to generalize better, even if they have been exposed to similar questions during training, suggesting they can learn more and apply it to a broader range of problems.

💡Multimodal model

A multimodal model is an AI system capable of processing and understanding multiple types of data, such as text, images, and video. The video highlights Google's Med Gemini as a multimodal model that can interact with various medical data formats, including electronic health records and surgical videos, to assist in diagnostics and procedures.

Highlights

Rumors suggest an imminent release of new OpenAI models, possibly named GPT 4.5, optimized for reasoning and planning.

Insiders reveal that only Google DeepMind has given the UK government early access to AI models, contrary to previous promises.

AI safety concerns are raised as the government has not yet safety-tested the latest models from major AI companies.

The performance of AI models on benchmarks may be influenced by the quality of their training data, as highlighted by a paper from Scale AI.

Large language models like GPT 4 and Claude demonstrate the ability to generalize and perform well on new, unseen questions.

Contamination of benchmark tests by models having seen the questions in their training data is a significant concern.

The paper suggests that larger models can learn elementary reasoning ability during training, even from contaminated data.

Google's Med Gemini model shows state-of-the-art performance in medical question answering, rivaling doctors' capabilities.

Innovations in Med Gemini include using search queries to resolve conflicts in model answers and fine-tuning models with correct answers.

Med Gemini's long context abilities allow it to process extensive medical records, which could greatly assist in diagnosis.

The model's performance on medical diagnosis is so advanced that it raises ethical questions about the deployment of AI in healthcare.

Google and Microsoft are in a competitive race to develop the most effective AI models for medical applications.

Med Gemini's multimodal capabilities enable it to analyze images and assist in surgeries, although it has not yet been deployed for ethical reasons.

The paper discusses the potential for improving Med Gemini by restricting its web searches to authoritative medical sources.

Despite its potential, Med Gemini is not open-sourced or widely available due to safety and commercial implications.

The development of AI models like Med Gemini represents a positive use of technology that could save lives by reducing medical errors.

The competition between tech giants to create better AI models for healthcare could lead to significant advancements in medical diagnostics and patient outcomes.

Transcripts

00:00

it has been a somewhat surreal few days

00:02

in AI for so many reasons and the month

00:05

of May promises to be yet stranger and

00:09

according to this under the radar

00:11

article company insiders and government

00:14

officials tell of an imminent release of

00:17

new openai models and yes of course the

00:20

strangeness at the end of April was

00:22

Amplified by the gpt2 chatbot a mystery

00:26

model showcased and then withdrawn

00:28

within days but which I did get to test

00:31

I thought testing it would be a slightly

00:34

more appropriate response than doing an

00:36

all cap video claiming that AGI has

00:39

arrived I also want to bring in two

00:41

papers released in the last 24 hours 90

00:44

pages in total and read in full they

00:47

might be more significant than any rumor

00:49

you have heard first things first though

00:52

that article from Politico that I

00:54

mentioned and the context to this

00:55

article is this there was an AI safety

00:58

Summit in Bletchley last year near to

01:00

where I live actually in Southern

01:02

England some of the biggest players in

01:04

AI like meta and open AI promised the UK

01:07

government that the UK government could

01:09

safety test the frontier latest models

01:11

before they were released there's only

01:13

one slight problem they haven't done it

01:16

now you might say that's just part of

01:17

the course for big Tech but the article

01:20

also revealed some interesting Insider

01:22

gossip Politico spoke to a host many

01:25

company insiders Consultants lobbyists

01:28

and government officials they spoke

01:30

anonymously over several months and not

01:32

only did we learn that it's only Google

01:34

deep mine that have given the government

01:36

Early Access we also learned that open

01:39

AI didn't now somewhat obviously that

01:41

tells us that they have a new model and

01:44

that it's very near to release now I

01:46

very much doubt they're going to call it

01:48

GPT 5 and you can see more of my reasons

01:51

for that in the video you can see on

01:53

screen but I think it's more likely to

01:55

be something like GPT 4.5 optimized for

01:59

reason Ing and planning now some of you

02:01

might be thinking is that all the

02:03

evidence you've got that a GPT 4.5 will

02:05

be coming before GPT 5 well not quite

02:08

how about this MIT technology review

02:11

interview conducted with samman in the

02:13

last few days in a private discussion

02:15

samman was asked if he knew when the

02:17

next version of GPT is slated to be

02:19

released and he said calmly yes now

02:22

think about it if the model had months

02:24

and months more of Uncertain safety

02:26

testing you couldn't be that confident

02:28

about a release date think about what

02:30

happened to Google Gemini Ultra which

02:32

was delayed and delayed and delayed that

02:34

again points to a more imminent release

02:37

then another bit of secondhand evidence

02:39

this time from an AI Insider on patreon

02:43

we have a wonderful Discord and this

02:45

Insider at a Stanford event put a

02:47

question directly to samman very

02:49

recently this was a different Stanford

02:51

event to the one I'm about to also quote

02:54

from and in this response Sam wman

02:55

confirmed that he's personally using the

02:58

unreleased version of their new model

03:00

but enough of secondhand sources what

03:02

about another direct quote from Sam

03:04

Alman well here's some more evidence

03:05

released yesterday that rather than drop

03:08

a bombshell GPT 5 on us which I predict

03:10

to come somewhere between November and

03:12

January they're going to give us an

03:14

iterative GPT 4.5 first he doesn't want

03:17

to surprise us it does kind of suck to

03:20

ship a product that you're embarrassed

03:22

about but it's much better than the

03:24

alternative and in this case in

03:25

particular where I think we really owe

03:27

it to society to deploy tively

03:31

one thing we've learned is that Ai and

03:32

surprise don't go well together people

03:33

don't want to be surprised people want a

03:35

gradual roll out and the ability to

03:36

influence these systems that's how we're

03:39

going to do it now he might want to tell

03:41

that to open ai's recent former head of

03:44

developer relations he now works at

03:46

Google and said something I really

03:48

appreciate about Google's culture is how

03:50

transparent things are 30 days in I feel

03:53

like I have a great understanding of

03:55

where we are going from a model

03:56

perspective having line of sight on this

03:58

makes it so much easier to start

04:00

building compelling developer products

04:02

it almost sounds like the workers at

04:04

open AI often don't have a great

04:07

understanding of where they're going

04:08

from a model perspective now In fairness

04:10

samman did say that the current GPT 4

04:13

will be significantly dumber than their

04:16

new model cha GPT is not phenomenal like

04:18

chpt is mildly embarrassing at best GPT

04:22

4 is the dumbest model any of you will

04:24

ever ever have to use again by a lot but

04:27

you know it's like important to ship

04:29

early and often and we believe in

04:30

iterative deployment so agency and

04:32

reasoning focused GPT 4.5 coming soon

04:36

but GPT 5 not until the end of the year

04:39

or early next those are my predictions

04:41

now some people were saying that the

04:43

mystery gpt2 chatbot could be GPT 4.5 it

04:48

was released on a site used to compare

04:51

the different outputs of language models

04:53

and look here is it creating a beautiful

04:56

unicorn which llama 3 couldn't do now I

04:59

Fran Bally got ready a tweet saying that

05:01

super intelligence had arrived but

05:03

quickly had to delete it and not just

05:05

because other people were reporting that

05:07

they couldn't get decent unicorns and

05:10

not just because that exact unicorn

05:12

could be found on the web but the main

05:14

reason was that I was one of the lucky

05:16

ones to get in and test gpt2 chat B on

05:19

the arena before it was withdrawn I

05:21

could only do eight questions but I gave

05:23

it my standard handcrafted so not on the

05:26

web set of test questions spanning logic

05:29

theory of mind mathematics coding and

05:31

more its performance was pretty much

05:34

identical to GPT 4 Turbo there was one

05:37

question that it would get right more

05:39

often than GPT 4 Turbo but that could

05:41

have been noise so if this was a sneak

05:44

preview of GPT 4.5 I don't think it's

05:48

going to shock and stun the entire

05:50

industry so tempting as it was to bang

05:52

out a video saying that AI has arrived

05:54

in all caps I resisted the urge since

05:57

then other testers have found broadly

06:00

the same thing on language translation

06:02

the mystery gpt2 chatbot massively

06:05

underperforms Claude Opus and still

06:07

underperforms gp4 turbo on an extended

06:10

test of logic it does about the same as

06:13

Opus and gp4 turbo of course that still

06:16

does leave the possibility that it is an

06:18

open AI model a tiny one and one that

06:20

they might even release open weights

06:23

meaning anyone can use it and in that

06:25

case the impressive thing would be how

06:27

well it's performing despite it size

06:30

well if gpt2 chatbot is a smaller model

06:33

how could it possibly be even vaguely

06:35

competitive the secret source is the

06:38

data as James Becker of open AI said

06:41

it's not so much about tweaking model

06:42

configurations and hyperparameters nor

06:45

is it really about architecture or

06:47

Optimizer choices behavior is determined

06:50

by your data set it is the data set that

06:53

you are approximating to an incredible

06:55

degree in a later post he referred to

06:58

the flaws of Dary 3 and gp4 and also

07:01

flaws in video probably referring to the

07:03

at the time unreleased Sora and said

07:05

they arise from a lack of data in a

07:08

specific domain and in a more recent

07:10

post he said that while computer

07:11

efficiency was still super important

07:14

anything can be stay of the-art with

07:16

enough scale compute and eval hacking

07:19

now we'll get to evaluation and

07:21

Benchmark hacking in just a moment but

07:23

it does seem to me that there are more

07:24

and more hints that you can Brute Force

07:27

performance with enough compute and and

07:29

as mentioned a quality data set at least

07:32

to me it seems increasingly clear that

07:34

you can pay your way to top performance

07:37

unless open AI reveal something

07:39

genuinely shocking the performance of

07:41

meta's llama 3 8 billion 70 billion and

07:44

soon 400 billion show that they have

07:46

less of a secret Source than many people

07:48

had thought and as Mark Zuckerberg

07:50

hinted recently it could just come down

07:52

to which company blinks first who among

07:55

Google meta and Microsoft which provides

07:57

the compute for open AI is willing to

08:00

continue to spend tens or hundreds of

08:03

billions of dollars on new models if the

08:05

secret is simply the data set that would

08:07

make less and less sense you over the

08:10

last few years I think there was this

08:11

issue of um GPU production yeah right so

08:15

even companies that had the money to pay

08:17

for the gpus couldn't necessarily get as

08:19

many as they wanted because there was

08:20

there were all these Supply constraints

08:22

now I think that's sort of getting less

08:25

so now I think you're seeing a bunch of

08:27

companies think about wow we should just

08:30

like really invest a lot of money in

08:32

building out these things and I think

08:33

that will go for um for some period of

08:36

time there is a capital question of like

08:38

okay at what point does it stop being

08:41

worth it to put the capital in but I

08:43

actually think before we hit that you're

08:45

going to run into energy constraints now

08:47

if you're curious about energy and data

08:49

center constraints check out my why does

08:52

open aai need a Stargate supercomputer

08:54

video released 4 weeks ago but before we

08:57

leave data centers and data set I must

09:00

draw your attention to this paper

09:02

released in the last 24 hours it's

09:04

actually a brilliant paper from scale AI

09:07

what they did was create a new and

09:09

refined version of a benchmark that's

09:11

used all the time to test the

09:14

mathematical reasoning capabilities of

09:16

AI models and there were at least four

09:18

fascinating findings relevant to all new

09:21

models coming out this year the first

09:23

the context and they worried that many

09:25

of the latest models had seen the

09:27

Benchmark questions in their training

09:29

data that's called contamination because

09:31

of course it contaminates the results on

09:33

the test the original test had 8,000

09:36

questions but what they did was create a

09:38

thousand new questions of similar

09:40

difficulty now if contamination wasn't a

09:42

problem then models should perform just

09:43

as well with the new questions as with

09:46

the old and obviously that didn't happen

09:49

for the mistol and fi family of models

09:51

performance notably lagged on the new

09:53

test compared to the old one whereas

09:55

fair's fair for GPT 4 and Claude

09:58

performance was the same or better on

10:00

the new fresh test but here's the thing

10:02

the authors figured out that that wasn't

10:04

just about Which models had seen the

10:06

questions in their training data they

10:08

say that mistol large which performed

10:10

exactly the same was just as likely to

10:12

have seen those questions as Mixr

10:14

instruct which way underperformed so

10:17

what could explain the difference well

10:19

the bigger models generalize even if

10:21

they have seen the questions they learn

10:23

more from them and can generalize to new

10:25

questions and here's another supporting

10:27

quote they lean toward the hypo

10:29

hypthesis that sufficiently strong large

10:31

language models learn Elementary

10:33

reasoning ability during training you

10:35

could almost say that benchmarks get

10:37

more reliable when you're talking about

10:39

the very biggest models next and this

10:41

seems to be a running theme in popular

10:43

ml benchmarks the GSM 8K designed for

10:46

high schoolers has a few errors they

10:49

didn't say how many but the answers were

10:51

supposed to be positive integers and

10:53

they weren't the new Benchmark however

10:55

passed through three layers of quality

10:57

checks third they provide extra theories

11:00

as to why models might overperform on

11:02

benchmarks compared to the real world

11:04

that's not just about data contamination

11:06

it could be that model Builders designed

11:08

data sets that are similar to test

11:11

questions after all if you were trying

11:13

to bake in reasoning to your model what

11:15

kind of data would you collect plenty of

11:17

exams and textbooks so the more similar

11:19

their data set is in nature not just

11:22

exact match to benchmarks the more your

11:25

benchmark performance will be elevated

11:27

compared to simple real world use think

11:29

about it it could be an inadvertent

11:31

thing where enhancing the overall

11:33

smartness of the model comes at the cost

11:36

of overperforming on benchmarks and

11:38

whatever you think about benchmarks that

11:40

does seem to work here Sebastian bck

11:43

lead author of The Five series of models

11:45

I've interviewed him for AI insiders and

11:47

he said this even on those 1,000 neverbe

11:51

seen questions 53 mini which is only 3.8

11:55

billion parameters performed within

11:57

about 8 or 9% of GP GT4 turbo now we

12:00

don't know the parameter count of GT4

12:02

turbo but it's almost certainly orders

12:04

of magnitude bigger so training on high

12:06

quality data as we have seen definitely

12:08

works even if it slightly skews

12:10

Benchmark performance but one final

12:12

observation from me about this paper I

12:15

read almost all the examples that the

12:17

paper gave from this new Benchmark and

12:19

as the paper mentions they involve basic

12:21

addition subtraction multiplication and

12:24

division after all the original test was

12:26

designed for youngsters you can pause

12:28

and try the questions yourself but

12:29

despite them being lots of words they

12:31

aren't actually hard at all so my

12:33

question is this why are models like

12:35

Claude 3 Opus still getting any of these

12:38

questions wrong remember they're scoring

12:40

around 60% in graduate level expert

12:43

reasoning the GP QA if Claude 3 Opus for

12:46

example can get questions right that

12:49

phds struggle to get right with Google

12:52

and 30 minutes why on Earth with five

12:55

short examples can they not get these

12:57

basic high school questions right either

13:00

there are still flaws in the test or

13:02

these models do have a limit in terms of

13:04

how much they can generalize now if you

13:06

like this kind of analysis feel free to

13:08

sign up to my completely free newsletter

13:11

it's called signal to noise and the link

13:13

is in the description and if you want to

13:16

chat in person about it the regional

13:18

networking on the AI insiders Discord

13:21

server is popping off there are meetings

13:24

being arranged not only in London but

13:26

Germany the Midwest Ireland San

13:29

Francisco Madrid Brazil and it goes on

13:32

and on honestly I've been surprised and

13:34

honored by the number of spontaneous

13:36

meetings being arranged across the world

13:39

but it's time arguably for the most

13:41

exciting development of the week Med

13:43

Gemini from Google it's a 58 page paper

13:47

but the tldr is this the latest series

13:50

of Gemini models from Google are more

13:53

than competitive with doctors at

13:55

providing medical answers and even in

13:57

areas where they can't quite perform

13:59

like in surgery they can be amazing

14:02

assistant in a world in which millions

14:04

of people die due to Medical errors this

14:07

could be a tremendous breakthrough Med

14:09

Gemini contains a number of Innovations

14:12

it wasn't just rerunning the same test

14:14

on a new model for example you can

14:16

inspect how confident a model is in its

14:18

answer by trolling through the raw

14:21

outputs of a model called the logic you

14:23

could see how high probability they find

14:25

their answers if they gave confident

14:27

answers you would submit that as the

14:29

answer they use this technique by the

14:30

way for the original Gemini launch where

14:32

they claimed to be GPT 4 but that's

14:34

another story anyway if the model is not

14:36

confident you can get the model to

14:38

generate search queries to resolve those

14:41

conflicts train it in other words to use

14:43

Google seems appropriate then you can

14:45

feed that additional context provided by

14:47

the web back into the model to see if

14:49

it's confident now but that was just one

14:51

Innovation what about this fine-tuning

14:53

Loop to oversimplify they get the model

14:56

to Output answers again using the help

14:58

of search and then the outputs that had

15:00

correct answers were used to fine-tune

15:03

the models now that's not perfect of

15:05

course because sometimes you can get the

15:06

right answer with the wrong logic but it

15:08

worked up to a certain point at least

15:10

just last week by the way on patreon I

15:12

described how this reinforced in context

15:15

learning can be applied to multiple

15:17

domains other Innovations come from the

15:19

incredible long context abilities of the

15:22

Gemini 1.5 series of models with that

15:24

family of models you can troll through a

15:27

700,000 word electronic health record

15:31

now imagine a human doctor trying to do

15:33

the same thing I remember on the night

15:35

of Gemini 1.5s release calling it the

15:38

biggest news of the day even more

15:39

significant than Sora and I still stand

15:41

by that so what were the results well of

15:44

course a state of-the-art performance on

15:46

Med QA that assesses your ability to

15:49

diagnose diseases the doctor pass rate

15:51

by the way is around 60% and how about

15:54

this for a mini theme of the video when

15:56

they carefully analyze the questions in

15:58

the Benchmark they found that 7.4% of

16:02

the questions have quality issues things

16:04

like lacking key information incorrect

16:06

ansers or multiple plausible

16:09

interpretations so just in this video

16:11

alone we've seen multiple Benchmark

16:13

issues and I collected a thread of other

16:16

Benchmark issues on Twitter the positive

16:18

news though is just how good these

16:20

models are getting at things like

16:21

medical note summarization and clinical

16:24

referral letter generation but I don't

16:26

want to detract from the headline which

16:28

is just how good these models are

16:29

getting at diagnosis here you can see

16:32

Med Gemini with search way outperforming

16:35

expert clinicians with search by the way

16:37

when errors from the test were taken out

16:40

performance bumped up to around 93% and

16:43

the authors can't wait to augment their

16:45

models with additional data things like

16:47

data from consumer wearables genomic

16:49

information nutritional data and

16:51

environmental factors and as a quite

16:53

amusing aside it seems like Google and

16:56

Microsoft are in a tussle to throw shade

16:59

at each other's methods in a positive

17:01

spirit Google contrast their approach to

17:03

Med prompt from Microsoft saying that

17:05

their approach is principled and it can

17:07

be easily extended to more complex

17:09

scenarios Beyond Med QA now you might

17:11

say that's harsh but Microsoft earlier

17:13

had said that their Med prompt approach

17:15

shows GPC 4's ability to outperform

17:18

Google's model that was fine-tuned

17:20

specifically for medical applications it

17:22

outperforms on the same benchmarks by a

17:25

significant margin well Google have

17:27

obviously won up them by reach new

17:29

state-of-the-art performances on 10 of

17:31

14 benchmarks Microsoft had also said

17:34

that their approach has simple prompting

17:36

and doesn't need more sophisticated and

17:38

expensive methods Google shot back

17:40

saying they don't need complex

17:41

specialized prompting and their approach

17:43

is best honestly this is competition

17:46

that I would encourage may they long

17:48

compete for Glory in this medical arena

17:51

in case you're wondering because Gemini

17:53

is a multimodal model it can see images

17:56

too you can interact with patients and

17:58

ask them to to provide images the model

18:00

can also interact with primary care

18:02

physicians and ask for things like

18:03

X-rays and most surprisingly to me it

18:06

can also interact with surgeons to help

18:09

boost performance yes that's video

18:11

assistance during live surgery of course

18:13

they haven't yet deployed this for

18:15

ethical and safety reasons but Gemini is

18:18

already capable of assessing a video

18:20

scene and helping with surgery for

18:23

example by answering whether the

18:24

critical view of safety criteria are

18:27

being met Do you have a great view of

18:29

the gallbladder for example and Med

18:31

Gemini could potentially guide surgeons

18:33

in real time during these complex

18:35

procedures for not only improved

18:36

accuracy but patient outcomes notice the

18:40

Nuance of the response from Gemini oh

18:42

the lower third of the gallbladder is

18:44

not dissected off the cystic plate and

18:47

the authors list many improvements that

18:49

they could have made they just didn't

18:50

have time to make for example the models

18:53

were searching the Wild web and one

18:55

option would be to restrict the search

18:58

results to just more authoritative

19:00

medical sources the catch though is that

19:02

this model is not open-sourced and isn't

19:05

widely available due to they say the

19:07

safety implications of unmonitored use I

19:10

suspect the commercial implications of

19:12

open sourcing Gemini also had something

19:14

to do with it but here's the question I

19:16

would set for you we know that hundreds

19:18

of thousands or even millions of people

19:20

die due to Medical mistakes around the

19:22

world so if and when Med Gemini 2 3 or 4

19:26

becomes unambiguously better than all

19:28

clinicians at diagnosing diseases then

19:31

at what point is it unethical not to at

19:34

least deploy them in assisting

19:36

clinicians that's definitely something

19:37

at least to think about overall this is

19:40

exciting and excellent work so many

19:42

congratulations to the team and in a

19:44

world that is seeing some Stark misuses

19:47

of AI as well as increasingly autonomous

19:51

deployment of AI like this autonomous

19:53

tank the fact that we can get

19:55

breakthroughs like this is genuinely

19:58

uplifting thank you so much for watching

20:00

to the end and have a wonderful day