Simple Introduction to Large Language Models (LLMs)

Matthew Berman
7 Mar 202425:19

Summary

TLDR本视频脚本深入探讨了大型语言模型(LLMs)的工作原理、历史、应用以及面临的挑战。从1966年的Eliza模型到2018年的BERT,再到2023年的GPT 4,详细介绍了LLMs的发展历程。讨论了LLMs的三大工作步骤:分词、嵌入和变换器,并解释了如何通过向量化和注意力机制理解语言。同时,也提到了LLMs在训练过程中的数据收集、预处理、模型调整和评估。此外,还涉及了微调技术,该技术允许在特定用例中优化预训练模型。视频还指出了LLMs的局限性,包括在数学逻辑、偏见和安全性方面的问题,以及硬件密集型和潜在的伦理问题。最后,探讨了LLMs在现实世界中的应用,如语言翻译、编程辅助、文本生成等,并展望了未来的发展方向,包括知识蒸馏、检索增强生成、多模态输入和提高推理能力等。

Takeaways

  • 📚 大型语言模型(LLMs)是经过大量文本数据训练的神经网络,能够理解和生成自然语言。
  • 🤖 神经网络模拟人脑工作方式,通过算法识别数据中的模式,而LLMs专注于自然语言的理解。
  • 🚀 与传统编程相比,LLMs采用更灵活的方法,教会计算机如何学习做事,而非直接给出指令。
  • 🖼️ LLMs在图像识别等应用中展现出强大的能力,能够通过示例学习并推断新情况。
  • ⏱️ LLMs的训练过程包括数据收集、预处理、训练和评估,需要大量数据和计算资源。
  • 🔍 LLMs使用注意力机制(attention mechanism)来理解句子中单词的上下文关系。
  • 📈 LLMs的训练涉及大量的参数调整,通过不断迭代优化模型性能,使用困惑度(perplexity)等指标评估效果。
  • 🛠️ 微调(fine-tuning)允许开发者利用预训练模型,并针对特定用例进行优化,提高准确性和效率。
  • 🤖 LLMs存在局限性,如在数学逻辑、偏见、安全性等方面仍有挑战。
  • 🌐 LLMs的实际应用非常广泛,包括语言翻译、编程辅助、文本生成、问答系统等。
  • 📉 知识蒸馏(knowledge distillation)是一种将大型模型的知识转移到更小、更高效模型的技术。
  • 🧐 伦理考量包括版权问题、模型可能被用于有害行为的风险,以及对劳动力市场的影响。

Q & A

  • 什么是大型语言模型(LLMs)?

    -大型语言模型(LLMs)是一种神经网络,它在大量文本数据上进行训练。它们通常在可以在线找到的数据上进行训练,包括网页抓取、书籍、文字记录等任何基于文本的内容。LLMs专注于理解自然语言,通过阅读大量书籍、文章和互联网文本进行学习。

  • 神经网络是如何工作的?

    -神经网络是一系列算法,旨在识别数据中的模式。它们的工作原理是模拟人脑的工作方式,尤其是LLMs,它们专注于理解自然语言。

  • 传统的编程与LLMs有何不同?

    -传统的编程是基于指令的,意味着如果X则Y,你明确地告诉计算机要做什么。而LLMs则不同,你不是在教计算机如何做事,而是如何学习做事,这是一种更灵活的方法,适用于许多传统编程无法完成的不同应用。

  • LLMs在图像识别方面如何应用?

    -在图像识别方面,传统编程需要为识别不同字母硬编码每一条规则。而LLMs则通过提供大量手写字母的示例,让计算机推断出新的手写字母的样子,基于所有示例。

  • 大型语言模型的训练过程包括哪些步骤?

    -大型语言模型的训练过程包括收集数据、数据预处理、训练、评估和可能的微调。首先需要大量数据,然后对数据进行预处理,接着将预处理后的文本数据输入模型进行训练,通过调整模型权重来优化输出。之后,使用一小部分数据对模型进行评估,必要时进行调整。

  • 什么是微调(fine-tuning)?

    -微调是指在大型语言模型已经具备通用语言能力的基础上,针对特定的用例进一步调整和优化模型。这比从头开始训练模型要快得多,并且可以产生更高的准确性。微调允许预训练模型针对现实世界的用例进行优化。

  • 大型语言模型存在哪些局限性和挑战?

    -大型语言模型存在多个局限性和挑战,包括在数学和逻辑推理方面的挣扎、偏见和安全性问题、知识截止日期的限制、产生幻觉(错误信息)以及硬件密集型导致的高昂成本。此外,它们可能被用于有害行为,并且对版权材料的使用也存在法律和伦理问题。

  • 大型语言模型在现实世界中有哪些应用?

    -大型语言模型可以用于多种任务,包括语言翻译、编程辅助、问答、文章写作、翻译,甚至图像和视频创作。几乎任何人类可以使用计算机完成的思维任务,大型语言模型也可能完成。

  • 什么是知识蒸馏,它如何使大型语言模型更实用?

    -知识蒸馏是一种技术,它将大型、尖端模型中的关键知识转移到更小、更高效的模型中。这允许较小的语言模型从大型语言模型中获得的知识中受益,同时仍能在普通消费硬件上高效运行,使大型语言模型更易于访问和实用。

  • 目前有哪些研究和进展旨在改善大型语言模型?

    -目前的研究和进展包括自我事实检查、混合专家模型、多模态输入处理、提高推理能力以及扩大上下文窗口等。这些技术旨在提高模型的效率、准确性和应用范围。

  • 什么是向量数据库,它在LLMs中扮演什么角色?

    -向量数据库是一种存储和检索机制,它对向量(即长串数字)进行了高度优化。在LLMs中,词嵌入被放置在向量数据库中,使模型能够根据向量的相似性轻松识别哪些词彼此相关,从而帮助模型预测基于前文的下一个词。

  • Transformers架构是如何帮助LLMs理解句子中单词的上下文的?

    -Transformers架构使用注意力机制来理解句子中单词的上下文。它涉及使用点积计算,这是一种数字,代表单词对句子的贡献程度。模型会找到单词点积的差异,并为注意力赋予相应的大值,从而在考虑该单词时给予更多重视。

Outlines

00:00

🤖 人工智能与大型语言模型简介

本段介绍了人工智能和大型语言模型(LLMs)的基础知识,解释了LLMs是如何通过大量文本数据训练的神经网络。提到了过去一年中人工智能如何改变世界,以及LLMs在各行各业的潜在应用。视频还将探讨LLMs的工作原理、伦理问题、迭代、应用等,并提到了与AI营的合作,这是一个教授高中生人工智能的项目。

05:02

📚 LLMs的历史与发展

这一部分深入探讨了LLMs的历史,从1966年的ELIZA模型开始,到Transformers架构的出现,再到GPT系列的发展。讨论了LLMs的规模增长,如GPT-3拥有1750亿参数,以及LLMs在处理自然语言方面的改进。此外,还提到了LLMs的工作方式,包括分词、嵌入向量和使用Transformers进行处理。

10:04

🧠 理解LLMs的工作原理

本段详细解释了LLMs的工作过程,包括分词、嵌入向量和Transformers的机制。讨论了词嵌入向量数据库如何帮助模型理解单词之间的关系,并通过向量表示捕捉语义含义。还介绍了Transformers如何使用多头注意力算法处理输入矩阵,并根据单词的贡献调整输出矩阵,最终生成自然语言。

15:05

🏋️‍♂️ 训练与优化LLMs

这一部分讨论了训练LLMs的过程,包括数据收集、预处理、模型调整和评估。强调了高质量数据集的重要性,以及训练过程中的硬件需求和成本。还提到了微调的概念,即将预训练模型调整为特定用例,以及AI Camp项目如何与学生合作创造内容。

20:06

🚧 LLMs的局限性与挑战

本段探讨了LLMs的局限性和挑战,包括在数学、逻辑和推理方面的不足,以及偏见和安全性问题。讨论了LLMs如何受到训练数据中人类观点的影响,以及可能出现的错误信息和过度自信的陈述。还提到了LLMs的硬件需求和伦理问题,以及它们可能对劳动力市场的影响。

25:08

🌐 LLMs的实际应用与未来展望

这一部分讨论了LLMs在现实世界中的广泛应用,包括语言翻译、编程辅助、文本摘要、问答、创作等。还提到了当前的研究和进步,如知识蒸馏、检索增强生成和多模态处理。最后,探讨了LLMs的伦理考量,包括版权问题、潜在的滥用以及人工智能的未来发展。

🎓 结语与AI Camp推广

视频的最后部分鼓励观众了解AI Camp,并提供了相关信息。同时,作者提到了其他AI相关视频,供想要深入了解的观众参考。

Mindmap

Keywords

💡大型语言模型(Large Language Models, LLMs)

大型语言模型是指通过大量文本数据训练的神经网络,它们能够理解和生成自然语言。在视频中,LLMs 是讨论的核心,因为它们正在改变我们与技术互动的方式,并且是 AI 领域的一个重要进步。例如,视频提到了 GPT-3 模型,它拥有 175 亿参数,并且能够以前所未有的准确性理解自然语言。

💡神经网络(Neural Networks)

神经网络是一系列算法,它们尝试在数据中识别模式,模拟人脑的工作方式。在视频中,神经网络是构成大型语言模型的基础技术,它们使得机器能够通过学习大量文本数据来理解和生成语言。

💡Transformers

Transformers 是一种先进的神经网络架构,它通过注意力机制(attention mechanism)来理解句子中单词的上下文关系。视频中提到,Transformers 架构是大型语言模型发展的一个重要里程碑,它允许模型更有效地训练,并且具备自我注意力等特性。

💡标记化(Tokenization)

标记化是大型语言模型处理文本数据的第一步,它涉及将长文本分割成单独的标记(tokens)。在视频中,标记化是理解模型如何逐个单词理解文本的基础,这对于模型捕捉单词的语义意义和它们之间的关系至关重要。

💡嵌入(Embeddings)

嵌入是将标记转换成数值向量的过程,这些向量使得计算机能够更容易地读取和理解每个单词,以及不同单词之间的关系。视频中解释了嵌入如何帮助模型通过数值表示来捕捉单词的语义,并预测句子中的下一个单词。

💡多模态(Multimodality)

多模态是指模型能够处理并整合来自不同输入源(如语音、图像、视频)的信息。视频中提到,多模态是大型语言模型未来研究的一个方向,它将使得模型能够更全面地理解和响应复杂的查询。

💡微调(Fine-tuning)

微调是指在大型语言模型的基础上,针对特定的用例进一步训练模型,以提高其在特定任务上的性能。视频中强调了微调的重要性,它允许开发者利用预训练模型,并针对特定任务进行优化,如披萨订单处理。

💡知识蒸馏(Knowledge Distillation)

知识蒸馏是一种技术,它将大型、复杂的模型中的知识转移到更小、更高效的模型中。视频中提到,这种方法可以使小型语言模型在保持高效的同时,也能从大型模型中获益。

💡检索增强生成(Retrieval-Augmented Generation, RAG)

检索增强生成是一种技术,它允许大型语言模型在生成回答时查询外部信息。视频中提到,RAG 是当前研究的重点之一,它使得模型能够访问和利用其训练数据之外的大量信息。

💡偏见和安全性(Bias and Safety)

偏见和安全性是大型语言模型面临的主要挑战之一。视频中讨论了模型可能会从训练数据中学习到人类的偏见,以及如何确保模型的使用是安全和道德的。这包括了对模型可能产生的有害行为的担忧,以及对版权材料使用权的考虑。

💡人工智能营(AI Camp)

AI Camp 是一个为 13 岁及以上学生提供的教育项目,它通过个性化小组和经验丰富的导师合作,让学生学习人工智能相关的知识。视频中提到了 AI Camp 与视频内容的合作,展示了教育在 AI 领域的重要性。

Highlights

视频将提供从完全不懂人工智能和大型语言模型到拥有扎实基础的全面知识。

大型语言模型(LLMs)是经过大量文本数据训练的神经网络,能够模拟人脑工作方式。

LLMs与传统编程不同,它们更灵活,能够学习如何学习事物,而非仅仅是执行指令。

LLMs在图像识别等任务中展现出比传统编程更强大的灵活性和适应性。

LLMs在文本生成、创意写作、问题回答和编程等多个领域表现出色。

大型语言模型的发展历程从1966年的Eliza模型开始,一直发展到当前的GPT-4模型。

Transformers架构的出现极大推动了LLMs的发展,它减少了训练时间并提高了性能。

GPT-3模型在2020年发布,具有175亿参数,标志着公众开始注意到大型语言模型。

大型语言模型的训练过程包括数据收集、预处理、训练和评估四个主要步骤。

训练大型语言模型需要大量的数据处理能力和电力,成本非常高。

微调(Fine-tuning)允许开发者针对特定用例调整预训练模型,提高准确性和效率。

AI Camp是一个为13岁以上学生设计的AI学习项目,通过实践学习NLP、计算机视觉和数据科学。

尽管LLMs能力强大,但它们在数学、逻辑和推理方面仍然存在局限。

大型语言模型可能包含人类偏见,并且可能被用于有害行为,如制造假信息。

知识蒸馏是一种将大型模型的关键知识转移到更小、更高效的模型的技术。

检索增强生成(RAG)允许大型语言模型查询其训练数据之外的大量数据。

大型语言模型的未来改进方向包括自我事实检查、混合专家技术、多模态输入和提高推理能力。

大型语言模型需要考虑的伦理问题包括版权、潜在的有害用途、职业影响以及与人类目标的一致性。

Transcripts

00:00

this video is going to give you

00:01

everything you need to go from knowing

00:03

absolutely nothing about artificial

00:05

intelligence and large language models

00:07

to having a solid foundation of how

00:10

these revolutionary Technologies work

00:12

over the past year artificial

00:14

intelligence has completely changed the

00:16

world with products like chat PT

00:18

potentially appending every single

00:20

industry and how people interact with

00:23

technology in general and in this video

00:25

I will be focusing on llms how they work

00:29

ethical cons iterations applications and

00:32

so much more and this video was created

00:34

in collaboration with an incredible

00:36

program called AI camp in which high

00:39

school students learn all about

00:40

artificial intelligence and I'll talk

00:42

more about that later in the video let's

00:44

go so first what is an llm is it

00:48

different from Ai and how is chat GPT

00:50

related to all of this llms stand for

00:54

large language models which is a type of

00:56

neural network that's trained on massive

00:58

amounts of text data it's generally

01:01

trained on data that can be found online

01:04

everything from web scraping to books to

01:06

transcripts anything that is text based

01:08

can be trained into a large language

01:10

model and taking a step back what is a

01:13

neural network a neural network is

01:15

essentially a series of algorithms that

01:17

try to recognize patterns in data and

01:20

really what they're trying to do is

01:21

simulate how the human brain works and

01:23

llms are a specific type of neural

01:26

network that focus on understanding

01:28

natural language and as mentioned llms

01:31

learn by reading tons of books articles

01:34

internet texts and there's really no

01:36

limitation there and so how do llms

01:38

differ from traditional programming well

01:41

with traditional programming it's

01:43

instruction based which means if x then

01:46

why you're explicitly telling the

01:48

computer what to do you're giving it a

01:50

set of instructions to execute but with

01:53

llms it's a completely different story

01:55

you're teaching the computer not how to

01:57

do things but how to learn how to do

01:59

things things and this is a much more

02:01

flexible approach and is really good for

02:04

a lot of different applications where

02:06

previously traditional coding could not

02:09

accomplish them so one example

02:11

application is image recognition with

02:13

image recognition traditional

02:15

programming would require you to

02:17

hardcode every single rule for how to

02:21

let's say identify different letters so

02:24

a b c d but if you're handwriting these

02:27

letters everybody's handwritten letters

02:29

look different so how do you use

02:30

traditional programming to identify

02:33

every single possible variation well

02:35

that's where this AI approach comes in

02:37

instead of giving a computer explicit

02:39

instructions for how to identify a

02:41

handwritten letter you instead give it a

02:43

bunch of examples of what handwritten

02:46

letters look like and then it can infer

02:48

what a new handwritten letter looks like

02:50

based on all of the examples that it has

02:53

what also sets machine learning and

02:55

large language models apart and this new

02:56

approach to programming is that they are

02:59

much more more flexible much more

03:01

adaptable meaning they can learn from

03:03

their mistakes and inaccuracies and are

03:05

thus so much more scalable than

03:07

traditional programming llms are

03:10

incredibly powerful at a wide range of

03:12

tasks including summarization text

03:15

generation creative writing question and

03:17

answer programming and if you've watched

03:20

any of my videos you know how powerful

03:23

these large language models can be and

03:25

they're only getting better know that

03:27

right now large language models and a in

03:30

general are the worst they'll ever be

03:32

and as we're generating more data on the

03:34

internet and as we use synthetic data

03:36

which means data created by other large

03:38

language models these models are going

03:40

to get better rapidly and it's super

03:43

exciting to think about what the future

03:44

holds now let's talk a little bit about

03:46

the history and evolution of large

03:48

language models we're going to cover

03:49

just a few of the large language models

03:51

today in this section the history of

03:53

llms traces all the way back to the

03:55

Eliza model which was from

03:57

1966 which was really the first first

03:59

language model it had pre-programmed

04:02

answers based on keywords it had a very

04:05

limited understanding of the English

04:06

language and like many early language

04:09

models you started to see holes in its

04:10

logic after a few back and forth in a

04:12

conversation and then after that

04:14

language models really didn't evolve for

04:16

a very long time although technically

04:18

the first recurrent neural network was

04:20

created in 1924 or RNN they weren't

04:23

really able to learn until 1972 and

04:26

these new learning language models are a

04:28

series of neural networks with layers

04:31

and weights and a whole bunch of stuff

04:33

that I'm not going to get into in this

04:35

video and rnns were really the first

04:38

technology that was able to predict the

04:40

next word in a sentence rather than

04:42

having everything pre-programmed for it

04:44

and that was really the basis for how

04:47

current large language models work and

04:49

even after this and the Advent of deep

04:51

learning in the early 2000s the field of

04:53

AI evolved very slowly with language

04:56

models far behind what we see today this

04:59

all changed in 2017 where the Google

05:02

Deep Mind team released a research paper

05:04

about a new technology called

05:06

Transformers and this paper was called

05:09

attention is all you need and a quick

05:11

side note I don't think Google even knew

05:13

quite what they had published at that

05:15

time but that same paper is what led

05:17

open AI to develop chat GPT so obviously

05:21

other computer scientists saw the

05:23

potential for the Transformers

05:24

architecture with this new Transformers

05:27

architecture it was far more advanced it

05:29

required decreased training time and it

05:31

had many other features like self

05:33

attention which I'll cover later in this

05:34

video Transformers allowed for

05:36

pre-trained large language models like

05:38

gpt1 which was developed by open AI in

05:41

2018 it had 117 million parameters and

05:45

it was completely revolutionary but soon

05:47

to be outclassed by other llms then

05:50

after that Bert was released beert in

05:53

2018 that had 340 million parameters and

05:57

had bir directionality which means it

05:59

had the ability to process text in both

06:01

directions which helped it have a better

06:04

understanding of context and as

06:06

comparison a unidirectional model only

06:09

has an understanding of the words that

06:10

came before the target text and after

06:13

this llms didn't develop a lot of new

06:16

technology but they did increase greatly

06:18

in scale gpt2 was released in early 2019

06:21

and had 2.5 billion parameters then GPT

06:25

3 in June of 2020 with 175 billion

06:29

paramet

06:29

and it was at this point that the public

06:31

started noticing large language models

06:33

GPT had a much better understanding of

06:36

natural language than any of its

06:38

predecessors and this is the type of

06:40

model that powers chat GPT which is

06:42

probably the model that you're most

06:43

familiar with and chat GPT became so

06:46

popular because it was so much more

06:48

accurate than anything anyone had ever

06:50

seen before and it was really because of

06:52

its size and because it was now built

06:54

into this chatbot format anybody could

06:57

jump in and really understand how to

06:59

interact act with this model Chad GPT

07:00

3.5 came out in December of 2022 and

07:03

started this current wave of AI that we

07:06

see today then in March 2023 GPT 4 was

07:09

released and it was incredible and still

07:12

is incredible to this day it had a

07:14

whopping reported 1.76 trillion

07:18

parameters and uses likely a mixture of

07:21

experts approach which means it has

07:23

multiple models that are all fine-tuned

07:25

for specific use cases and then when

07:27

somebody asks a question to it it

07:29

chooses which of those models to use and

07:31

then they added multimodality and a

07:33

bunch of other features and that brings

07:35

us to where we are today all right now

07:37

let's talk about how llms actually work

07:39

in a little bit more detail the process

07:41

of how large language models work can be

07:43

split into three steps the first of

07:46

these steps is called tokenization and

07:48

there are neural networks that are

07:50

trained to split long text into

07:52

individual tokens and a token is

07:55

essentially about 34s of a word so if

07:58

it's a shorter word like high or that or

08:01

there it's probably just one token but

08:03

if you have a longer word like

08:05

summarization it's going to be split

08:07

into multiple pieces and the way that

08:09

tokenization happens is actually

08:11

different for every model some of them

08:12

separate prefixes and suffixes let's

08:15

look at an example what is the tallest

08:17

building so what is the tallest building

08:22

are all separate tokens and so that

08:24

separates the suffix off of tallest but

08:26

not building because it is taking the

08:28

context into account and this step is

08:30

done so models can understand each word

08:33

individually just like humans we

08:35

understand each word individually and as

08:37

groupings of words and then the second

08:39

step of llms is something called

08:41

embeddings the large language models

08:43

turns those tokens into embedding

08:45

vectors turning those tokens into

08:47

essentially a bunch of numerical

08:49

representations of those tokens numbers

08:52

and this makes it significantly easier

08:54

for the computer to read and understand

08:56

each word and how the different words

08:58

relate to each other and these numbers

09:00

all correspond with the position in an

09:02

embeddings Vector database and then the

09:04

final step in the process is

09:06

Transformers which we'll get to in a

09:08

little bit but first let's talk about

09:10

Vector databases and I'm going to use

09:11

the terms word and token interchangeably

09:14

so just keep that in mind because

09:15

they're almost the same thing not quite

09:17

but almost and so these word embeddings

09:20

that I've been talking about are placed

09:22

into something called a vector database

09:24

these databases are storage and

09:25

retrieval mechanisms that are highly

09:28

optimized for vectors and again those

09:30

are just numbers long series of numbers

09:32

because they're converted into these

09:34

vectors they can easily see which words

09:36

are related to other words based on how

09:39

similar they are how close they are

09:41

based on their embeddings and that is

09:43

how the large language model is able to

09:45

predict the next word based on the

09:47

previous words Vector databases capture

09:49

the relationship between data as vectors

09:52

in multidimensional space I know that

09:54

sounds complicated but it's really just

09:56

a lot of numbers vectors are objects

09:59

with a magnitude and a direction which

10:01

both influence how similar one vector is

10:04

to another and that is how llms

10:06

represent words based on those numbers

10:08

each word gets turned into a vector

10:10

capturing semantic meaning and its

10:13

relationship to other words so here's an

10:15

example the words book and worm which

10:18

independently might not look like

10:20

they're related to each other but they

10:21

are related Concepts because they

10:23

frequently appear together a bookworm

10:26

somebody who likes to read a lot and

10:27

because of that they will have

10:29

embeddings that look close to each other

10:31

and so models build up an understanding

10:33

of natural language using these

10:34

embeddings and looking for similarity of

10:36

different words terms groupings of words

10:39

and all of these nuanced relationships

10:41

and the vector format helps models

10:43

understand natural language better than

10:45

other formats and you can kind of think

10:47

of all this like a map if you have a map

10:49

with two landmarks that are close to

10:51

each other they're likely going to have

10:53

very similar coordinates so it's kind of

10:55

like that okay now let's talk about

10:57

Transformers mat Matrix representations

11:00

can be made out of those vectors that we

11:02

were just talking about this is done by

11:04

extracting some information out of the

11:06

numbers and placing all of the

11:08

information into a matrix through an

11:10

algorithm called multihead attention the

11:13

output of the multi-head attention

11:15

algorithm is a set of numbers which

11:17

tells the model how much the words and

11:20

its order are contributing to the

11:22

sentence as a whole we transform the

11:25

input Matrix into an output Matrix which

11:28

will then correspond with a word having

11:31

the same values as that output Matrix so

11:33

basically we're taking that input Matrix

11:35

converting it into an output Matrix and

11:38

then converting it into natural language

11:40

and the word is the final output of this

11:42

whole process this transformation is

11:44

done by the algorithm that was created

11:46

during the training process so the

11:48

model's understanding of how to do this

11:50

transformation is based on all of its

11:52

knowledge that it was trained with all

11:54

of that text Data from the internet from

11:56

books from articles Etc and it learned

11:58

which sequences of of words go together

12:00

and their corresponding next words based

12:02

on the weights determined during

12:04

training Transformers use an attention

12:06

mechanism to understand the context of

12:09

words within a sentence it involves

12:11

calculations with the dot product which

12:13

is essentially a number representing how

12:15

much the word contributed to the

12:17

sentence it will find the difference

12:19

between the dot products of words and

12:21

give it correspondingly large values for

12:24

attention and it will take that word

12:26

into account more if it has higher

12:28

attention now now let's talk about how

12:29

large language models actually get

12:31

trained the first step of training a

12:33

large language model is collecting the

12:35

data you need a lot of data when I say

12:38

billions of parameters that is just a

12:41

measure of how much data is actually

12:43

going into training these models and you

12:45

need to find a really good data set if

12:47

you have really bad data going into a

12:49

model then you're going to have a really

12:51

bad model garbage in garbage out so if a

12:54

data set is incomplete or biased the

12:56

large language model will be also and

12:58

data sets are huge we're talking about

13:01

massive massive amounts of data they

13:03

take data in from web pages from books

13:06

from conversations from Reddit posts

13:08

from xposts from YouTube transcriptions

13:12

basically anywhere where we can get some

13:14

Text data that data is becoming so

13:16

valuable let me put into context how

13:19

massive the data sets we're talking

13:20

about really are so here's a little bit

13:22

of text which is 276 tokens that's it

13:25

now if we zoom out that one pixel is

13:28

that many tokens and now here's a

13:30

representation of 285 million tokens

13:34

which is

13:35

0.02% of the 1.3 trillion tokens that

13:38

some large language models take to train

13:40

and there's an entire science behind

13:42

data pre-processing which prepares the

13:44

data to be used to train a model

13:47

everything from looking at the data

13:48

quality to labeling consistency data

13:51

cleaning data transformation and data

13:54

reduction but I'm not going to go too

13:55

deep into that and this pre-processing

13:58

can take a long time and it depends on

14:00

the type of machine being used how much

14:02

processing power you have the size of

14:04

the data set the number of

14:05

pre-processing steps and a whole bunch

14:08

of other factors that make it really

14:10

difficult to know exactly how long

14:11

pre-processing is going to take but one

14:13

thing that we know takes a long time is

14:15

the actual training companies like

14:17

Nvidia are building Hardware

14:19

specifically tailored for the math

14:21

behind large language models and this

14:23

Hardware is constantly getting better

14:25

the software used to process these

14:27

models are getting better also and so

14:29

the total time to process models is

14:31

decreasing but the size of the models is

14:33

increasing and to train these models it

14:35

is extremely expensive because you need

14:37

a lot of processing power electricity

14:40

and these chips are not cheap and that

14:43

is why Nvidia stock price has

14:44

skyrocketed their revenue growth has

14:46

been extraordinary and so with the

14:49

process of training we take this

14:50

pre-processed text data that we talked

14:53

about earlier and it's fed into the

14:54

model and then using Transformers or

14:57

whatever technology a model is actually

14:59

based on but most likely Transformers it

15:02

will try to predict the next word based

15:04

on the context of that data and it's

15:06

going to adjust the weights of the model

15:09

to get the best possible output and this

15:12

process repeats millions and millions of

15:14

times over and over again until we reach

15:16

some optimal quality and then the final

15:19

step is evaluation a small amount of the

15:21

data is set aside for evaluation and the

15:23

model is tested on this data set for

15:26

performance and then the model is is

15:28

adjusted if necessary the metric used to

15:31

determine the effectiveness of the model

15:33

is called perplexity it will compare two

15:36

words based on their similarity and it

15:38

will give a good score if the words are

15:40

related and a bad score if it's not and

15:42

then we also use rlf reinforcement

15:45

learning through human feedback and

15:47

that's when users or testers actually

15:50

test the model and provide positive or

15:52

negative scores based on the output and

15:54

then once again the model is adjusted as

15:57

necessary all right let's talk about

15:58

fine-tuning now which I think a lot of

16:00

you are going to be interested in

16:02

because it's something that the average

16:03

person can get into quite easily so we

16:06

have these popular large language models

16:08

that are trained on massive sets of data

16:11

to build general language capabilities

16:13

and these pre-trained models like Bert

16:16

like GPT give developers a head start

16:18

versus training models from scratch but

16:20

then in comes fine-tuning which allows

16:23

us to take these raw models these

16:25

Foundation models and fine-tune them for

16:28

our specific specific use cases so let's

16:30

think about an example let's say you

16:31

want to fine tuna model to be able to

16:33

take pizza orders to be able to have

16:35

conversations answer questions about

16:37

pizza and finally be able to allow

16:40

customers to buy pizza you can take a

16:42

pre-existing set of conversations that

16:45

exemplify the back and forth between a

16:47

pizza shop and a customer load that in

16:49

fine- tune a model and then all of a

16:51

sudden that model is going to be much

16:53

better at having conversations about

16:55

pizza ordering the model updates the

16:57

weights to be better at understanding

16:59

certain Pizza terminology questions

17:02

responses tone everything and

17:04

fine-tuning is much faster than a full

17:07

training and it produces much higher

17:09

accuracy and fine-tuning allows

17:11

pre-trained models to be fine-tuned for

17:13

real world use cases and finally you can

17:16

take a single foundational model and

17:18

fine-tune it any number of times for any

17:21

number of use cases and there are a lot

17:23

of great Services out there that allow

17:25

you to do that and again it's all about

17:27

the quality of your data so if you have

17:29

a really good data set that you're going

17:31

to f- tune a model on the model is going

17:33

to be really really good and conversely

17:35

if you have a poor quality data set it's

17:37

not going to perform as well all right

17:39

let me pause for a second and talk about

17:41

AI Camp so as mentioned earlier this

17:44

video all of its content the animations

17:46

have been created in collaboration with

17:48

students from AI Camp AI Camp is a

17:51

learning experience for students that

17:52

are aged 13 and above you work in small

17:55

personalized groups with experienced

17:57

mentors you work together to create an

18:00

AI product using NLP computer vision and

18:03

data science AI Camp has both a 3-week

18:06

and a onewe program during summer that

18:09

requires zero programming experience and

18:11

they also have a new program which is 10

18:13

weeks long during the school year which

18:15

is less intensive than the onewe and

18:17

3-we programs for those students who are

18:19

really busy AI Camp's mission is to

18:22

provide students with deep knowledge and

18:24

artificial intelligence which will

18:26

position them to be ready for a in the

18:29

real world I'll link an article from USA

18:31

Today in the description all about AI

18:33

camp but if you're a student or if

18:35

you're a parent of a student within this

18:37

age I would highly recommend checking

18:38

out AI Camp go to ai- camp.org to learn

18:43

more now let's talk about limitations

18:45

and challenges of large language models

18:47

as capable as llms are they still have a

18:50

lot of limitations recent models

18:52

continue to get better but they are

18:53

still flawed they're incredibly valuable

18:56

and knowledgeable in certain ways but

18:58

they're also deeply flawed in others

18:59

like math and logic and reasoning they

19:02

still struggle a lot of the time versus

19:04

humans which understand Concepts like

19:06

that pretty easily also bias and safety

19:09

continue to be a big problem large

19:11

language models are trained on data

19:13

created by humans which is naturally

19:16

flawed humans have opinions on

19:18

everything and those opinions trickle

19:20

down into these models these data sets

19:23

may include harmful or biased

19:25

information and some companies take

19:26

their models a step further and provide

19:29

a level of censorship to those models

19:31

and that's an entire discussion in

19:32

itself whether censorship is worthwhile

19:35

or not I know a lot of you already know

19:36

my opinions on this from my previous

19:38

videos and another big limitation of

19:40

llms historically has been that they

19:42

only have knowledge up into the point

19:44

where their training occurred but that

19:46

is starting to be solved with chat GPT

19:49

being able to browse the web for example

19:51

Gro from x. aai being able to access

19:53

live tweets but there's still a lot of

19:55

Kinks to be worked out with this also

19:57

another another big challenge for large

19:59

language modelss is hallucinations which

20:01

means that they sometimes just make

20:03

things up or get things patently wrong

20:06

and they will be so confident in being

20:08

wrong too they will state things with

20:10

the utmost confidence but will be

20:12

completely wrong look at this example

20:15

how many letters are in the string and

20:17

then we give it a random string of

20:18

characters and then the answer is the

20:21

string has 16 letters even though it

20:23

only has 15 letters another problem is

20:26

that large language models are EXT

20:28

extremely Hardware intensive they cost a

20:31

ton to train and to fine-tune because it

20:34

takes so much processing power to do

20:36

that and there's a lot of Ethics to

20:39

consider too a lot of AI companies say

20:41

they aren't training their models on

20:43

copyrighted material but that has been

20:45

found to be false currently there are a

20:48

ton of lawsuits going through the courts

20:50

about this issue next let's talk about

20:52

the real world applications of large

20:54

language models why are they so valuable

20:57

why are they so talked about about and

20:58

why are they transforming the world

21:00

right in front of our eyes large

21:02

language models can be used for a wide

21:04

variety of tasks not just chatbots they

21:07

can be used for language translation

21:09

they can be used for coding they can be

21:11

used as programming assistants they can

21:13

be used for summarization question

21:15

answering essay writing translation and

21:18

even image and video creation basically

21:20

any type of thought problem that a human

21:22

can do with a computer large language

21:24

models can likely also do if not today

21:28

pretty soon in the future now let's talk

21:30

about current advancements and research

21:32

currently there's a lot of talk about

21:33

knowledge distillation which basically

21:35

means transferring key Knowledge from

21:37

very large Cutting Edge models to

21:39

smaller more efficient models think

21:41

about it like a professor condensing

21:43

Decades of experience in a textbook down

21:46

to something that the students can

21:48

comprehend and this allows smaller

21:50

language models to benefit from the

21:51

knowledge gained from these large

21:53

language models but still run highly

21:55

efficiently on everyday consumer

21:57

hardware and and it makes large language

21:59

models more accessible and practical to

22:01

run even on cell phones or other end

22:04

devices there's also been a lot of

22:06

research and emphasis on rag which is

22:08

retrieval augmented generation which

22:10

basically means you're giving large

22:12

language models the ability to look up

22:14

information outside of the data that it

22:16

was trained on you're using Vector

22:18

databases the same way that large

22:20

language models are trained but you're

22:22

able to store massive amounts of

22:24

additional data that can be queried by

22:26

that large language model now let's talk

22:28

about the ethical considerations and

22:30

there's a lot to think about here and

22:31

I'm just touching on some of the major

22:34

topics first we already talked about

22:36

that the models are trained on

22:37

potentially copyrighted material and if

22:39

that's the case is that fair use

22:41

probably not next these models can and

22:45

will be used for harmful acts there's no

22:47

avoiding it large language models can be

22:49

used to scam other people to create

22:52

massive misinformation and

22:53

disinformation campaigns including fake

22:56

images fake text fake opinions and

22:59

almost definitely the entire White

23:01

Collar Workforce is going to be

23:02

disrupted by large language models as I

23:05

mentioned anything anybody can do in

23:07

front of a computer is probably

23:09

something that the AI can also do so

23:11

lawyers writers programmers there are so

23:14

many different professions that are

23:16

going to be completely disrupted by

23:18

artificial intelligence and then finally

23:20

AGI what happens when AI becomes so

23:24

smart and maybe even starts thinking for

23:26

itself this is where we have to have

23:28

something called alignment which means

23:30

the AI is aligned to the same incentives

23:32

and outcomes as humans so last let's

23:35

talk about what's happening on The

23:36

Cutting Edge and in the immediate future

23:38

there are a number of ways large

23:40

language models can be improved first

23:42

they can fact check themselves with

23:44

information gathered from the web but

23:45

obviously you can see the inherent flaws

23:47

in that then we also touched on mixture

23:50

of experts which is an incredible new

23:53

technology which allows multiple models

23:55

to kind of be merged together all fine

23:57

tune to be experts in certain domains

24:00

and then when the actual prompt comes

24:02

through it chooses which of those

24:04

experts to use so these are huge models

24:06

that actually run really really

24:08

efficiently and then there's a lot of

24:10

work on multimodality so taking input

24:12

from voice from images from video every

24:15

possible input source and having a

24:17

single output from that there's also a

24:19

lot of work being done to improve

24:21

reasoning ability having models think

24:23

slowly is a new trend that I've been

24:26

seeing in papers like orca too which

24:28

basically just forces a large language

24:30

model to think about problems step by

24:32

step rather than trying to jump to the

24:34

final conclusion immediately and then

24:37

also larger context sizes if you want a

24:40

large language model to process a huge

24:42

amount of data it has to have a very

24:44

large context window and a context

24:46

window is just how much information you

24:47

can give to a prompt to get the output

24:51

and one way to achieve that is by giving

24:53

large language models memory with

24:55

projects like mgpt which I did a video

24:57

on and I'll drop that in the description

24:59

below and that just means giving models

25:01

external memory from that core data set

25:04

that they were trained on so that's it

25:05

for today if you liked this video please

25:07

consider giving a like And subscribe

25:09

check out AI Camp I'll drop all the

25:11

information in the description below and

25:13

of course check out any of my other AI

25:15

videos if you want to learn even more

25:17

I'll see you in the next one