This Server CANNOT Lose Data…

Linus Tech Tips
21 Mar 202427:57

Summary

TLDRThe script details the process of upgrading a video production team's storage and editing server, named 'Wanic,' to improve reliability and performance. The team transitions from a single server setup to a high-availability configuration using Super Micro Grand Twin A+ servers, each housing four independent compute nodes with AMD EPYC Genoa processors, 384GB of memory, and NVMe drives. The new system is designed to handle the team's extensive video projects with minimal downtime, leveraging software like WCA for file distribution and Axle AI for media asset management, showcasing impressive read speeds and IOPS. The video also explores the potential of AI in video editing and the challenges of integrating new hardware into an existing infrastructure.

Takeaways

  • 🚀 The team has upgraded their storage and editing server, named 'Wanic 10', to improve efficiency and reduce downtime costs.
  • 🔄 The new server setup includes redundant drives and high availability features to ensure minimal disruption to the workflow.
  • 💻 Super Micro Grand Twin A+ servers power the new system, each containing four independent computers within a 2U chassis.
  • 🌟 Each server node boasts 384GB of memory, an AMD EPYC Genoa processor with 64 cores, and support for up to 24 NVMe drives.
  • 💡 The servers are equipped with 2200W 80 PLUS Titanium power supplies to handle the high-performance components.
  • 🔌 The system supports live drive upgrading and swapping without downtime, thanks to the flexibility of WCA (WekaIO Matrix)
  • 📊 The new storage solution demonstrated impressive read latency of 131 microseconds, achieving 4 million IOPS (Input/Output Operations Per Second).
  • 🔄 The team tested the high availability by simulating server failures, which were handled smoothly without affecting ongoing editing tasks.
  • 🎥 The setup is designed to support video editing with Adobe Premiere, which is sensitive to latency and benefits from the system's low-latency storage.
  • 🤖 AI integration allows for efficient search and organization of the extensive video archive, enabling quick retrieval of specific clips.
  • 🔗 The infrastructure relies on a combination of CPU and GPU resources for different AI tasks, with a dedicated GPU workstation for object and scene analysis.

Q & A

  • What is the main issue with the current storage setup mentioned in the script?

    -The main issue with the current storage setup is that it is all housed in a single server, which poses a significant risk of downtime and data loss if that server fails.

  • What is the term used in the script to describe the new server setup?

    -The term used to describe the new server setup is 'high availability', which aims to minimize downtime and data loss by incorporating redundancy and fault tolerance.

  • Which company sponsored the new server setup with their servers?

    -Super Micro sponsored the new server setup with their servers.

  • What are the key components of the Super Micro Grand Twin A+ server mentioned in the script?

    -The key components of the Super Micro Grand Twin A+ server include four independent computers, each with its own motherboard, 384 GB of memory, an AMD EPYC Genoa processor with 64 cores, dual M.2 slots for redundant boot drives, six PCIe Gen 5 slots, and multiple NVMe drives.

  • How many NVMe drives are installed in each node of the server setup?

    -Two NVMe drives are installed in each node of the server setup, with one being a 7 TB drive and the other a 15 TB drive.

  • What is the purpose of the OCP 3.0 small form factor mezzanine slots in the server setup?

    -The OCP 3.0 small form factor mezzanine slots are used to install the ConnectX-6 200 Gbit cards from Nvidia, which provide high-speed network connectivity for the servers.

  • What is the significance of the 2200 Watts 80 Plus Titanium power supplies in the server setup?

    -The 2200 Watts 80 Plus Titanium power supplies are necessary to handle the high power requirements of the server setup, which includes four 400 Watt EPYC Genoa CPUs, a large amount of RAM, up to 24 NVMe drives, and eight network cards.

  • How does the new server setup handle the potential failure of a machine?

    -The new server setup is designed to be highly available, meaning it should be able to continue operating uninterrupted even if one or two machines fail, thanks to its redundancy and the fact that each system has its own drives.

  • What is the role of the WCA (WekaIO) file system in the server setup?

    -The WCA file system is a high-performance, distributed file system designed specifically for NVMe drives. It is used to handle the distribution of terabytes of data to multiple machines and provides low-latency, high-throughput storage for video editing and other demanding applications.

  • What is the AI technology being used for in the script?

    -The AI technology is being used for media asset management, which includes facial recognition, object detection, and scene understanding. This allows for efficient search and retrieval of specific video clips and content from a large archive of footage.

  • How does the new server setup affect the editing process?

    -The new server setup provides a highly available and fast storage solution that minimizes downtime and improves the editing process by ensuring smooth access to large video files and projects, reducing the risk of software crashes due to latency.

Outlines

00:00

🚀 Introduction to High Availability Storage Solution

The paragraph introduces the need for a reliable and fast storage solution due to the high volume of video production work. The team's main editing server, 'Wanic,' has been reliable over the years but as the team grows, even a minute of downtime costs significantly. The solution is to add redundancy, which is achieved through a new setup called 'Wanic 10,' emphasizing high availability (HA). The setup includes two grand twin boxes, each containing four independent servers provided by Super Micro, the sponsor of the project.

05:01

🛠️ Detailed Overview of the Super Micro Grand Twin A+ Server

This paragraph delves into the specifics of the Super Micro Grand Twin A+ server, highlighting its capabilities and components. Each 2U server contains four independent computers, each with its own motherboard, 384GB of memory, an AMD EPYC Genoa processor with 64 cores, dual M.2 slots for redundant boot drives, and multiple PCIe Gen 5 and NVMe slots. The server's power supply is discussed, with each unit providing 2200 Watts and being 80 PLUS Titanium certified. The paragraph also touches on the high availability aspect, mentioning the need for redundant network cards and switches, and the goal of the new setup to withstand the failure of individual components without affecting operations.

10:03

🔧 Assembly and Configuration of the High Availability System

The paragraph describes the assembly process of the new high availability system. It discusses the installation of the CPUs, the application of thermal paste, and the populating of memory channels with 32GB DIMMs of DDR5 ECC RAM. The paragraph also covers the installation of boot drives and the storage configuration, which includes two types of drives in each node. The setup's dashboard and the way it allocates resources for different tasks, such as drive containers and compute cores, are also explained. The paragraph concludes with a test of the system's resilience by simulating a server failure.

15:04

📊 Impressive Performance Metrics and Future Plans

This paragraph focuses on the performance metrics of the new storage system, highlighting the impressive read latency and IOPS (Input/Output Operations Per Second) achieved. It discusses the system's ability to handle high throughput and the potential for future upgrades, such as increasing the number of drives per node. The paragraph also talks about the use of the system for various types of workloads, including video editing and AI development, and mentions the capabilities of the WCA file system. The potential for using the system as a file server for Windows machines is also discussed.

20:05

🎥 Utilizing AI for Media Asset Management and Searchability

The paragraph discusses the integration of AI for media asset management, allowing for the search and retrieval of specific clips based on content. It describes the process of generating proxies for the vast archive of footage and the use of a GPU workstation for AI analysis. The paragraph outlines the capabilities of the AI in identifying objects, scenes, and even specific people within video clips. It also touches on the potential for improving searchability and the challenges of managing the cables in the server rack.

25:09

💻 Testing the Resilience of the High Availability System

The final paragraph demonstrates the resilience of the high availability system by intentionally removing servers from the network to simulate a catastrophic failure. The system continues to function smoothly despite the removal of critical components, showcasing its robustness. The paragraph concludes with acknowledgments to Super Micro for the servers, WCA for the software, and Axle for the AI detection, as well as a thank you to the viewers for their support.

Mindmap

Keywords

💡High Availability

High Availability refers to the design and maintenance of systems to ensure they are accessible and operational as much as possible. In the context of the video, it is crucial for the team's workflow as downtime can be costly. The video discusses the implementation of a server setup that aims to minimize downtime and maintain high availability, even in the event of hardware failure.

💡WCK Final Form

WCK Final Form is the name given to the upgraded server system described in the video. It represents the culmination of efforts to improve the team's storage and editing capabilities. The name suggests that this is the final and most advanced version of the WCK system, emphasizing its high performance and reliability.

💡Redundant NVMe

Redundant NVMe refers to the use of non-volatile memory express (NVMe) drives in a redundant configuration to ensure data reliability and system performance. In the video, this term is used to describe the storage solution that allows for high-speed data access and the ability to withstand the failure of individual drives without data loss or significant performance degradation.

💡Supermicro Grand Twin A+ Server

Supermicro Grand Twin A+ Server is a specific model of server hardware described in the video. It is notable for housing multiple independent compute nodes within a single 2U chassis, each with its own motherboard, memory, and processing power. This server is integral to the video's narrative as it forms the backbone of the new high availability system.

💡WCA (WekaIO)

WCA, or WekaIO, is a high-performance, scalable file system designed for data-intensive workloads. In the video, it is used as the distributed storage solution that provides the necessary speed and reliability for the team's video editing and storage needs. WekaIO is highlighted for its ability to handle large amounts of data with low latency, making it ideal for high-speed storage requirements.

💡ConnectX-6 200Gb Cards

ConnectX-6 200Gb Cards are high-speed network interface cards (NICs) used to connect servers to the network. These cards, provided by Mellanox (now part of Nvidia), are capable of delivering 200 gigabit per second (Gbps) of bandwidth. In the context of the video, these cards are essential for the server's ability to handle the data transfer demands of the video editing workflow.

💡Mother Vault

Mother Vault is the name given to the team's archival storage system, which is described as holding years' worth of footage and data. It is a critical component of their infrastructure, allowing them to maintain access to a vast library of content. The video discusses the integration of the Mother Vault with the new server setup to improve efficiency and accessibility of the stored content.

💡AI Detection

AI Detection in the context of the video refers to the use of artificial intelligence to analyze and identify objects, scenes, and faces within video footage. The video describes the use of AI for content analysis within their storage system, which can significantly improve the organization and retrieval of media assets.

💡Latency

Latency in the context of the video refers to the delay in data transfer or the time it takes for a request to be processed and responded to. Low latency is crucial for video editing, as it affects the responsiveness of the editing software and the overall workflow efficiency. The video highlights the importance of minimizing latency in their storage and server setup.

💡Scalability

Scalability refers to the ability of a system to handle growth, either in the number of users or the volume of data, by adding resources or components as needed. In the video, scalability is a key feature of the new server setup, allowing the team to accommodate future growth in their storage and processing needs without significant disruptions to their workflow.

💡Redundant Power Supplies

Redundant Power Supplies indicate that a system has backup power sources to ensure continuous operation in the event of a power failure. In the video, this is a critical component of the high availability setup, ensuring that if one power supply fails, the system can continue to function without interruption.

Highlights

The team has reached a point where a single minute of downtime costs over $50 in payroll, emphasizing the need for high availability storage solutions.

The main editing server, Wanic, has been reliable for years, but as the team grows, the need for redundancy becomes more critical.

Wanic 10 is introduced as the final form of the server, designed for high availability with the ability to handle unplugging without noticeable impact.

The new server setup includes two Grand Twin boxes, each containing four entire servers provided by Super Micro, the sponsor of the project.

Each server inside the Grand Twin boxes has 384GB of memory, an AMD EPYC Genoa processor with 64 cores, and dual M.2 slots for redundant boot drives.

The server setup features six PCIe Gen 5 slots, with 2.5 in NVMe slots upfront and additional slots in the rear for I/O.

The servers are equipped with 2200 Watts 80 Plus Titanium power supplies, capable of handling high-performance components.

The network cards installed are ConnectX-6 200 Gbit cards from Nvidia, providing high-speed connectivity despite being limited by the slot speed.

The server design allows for high availability, with the system able to continue operating even if one of the servers dies.

The new server setup is tested by moving the entire team onto it without notice during a busy workday, demonstrating its reliability and performance.

The software used for distributing terabytes of video projects and other data is WCA, a redundant NVMe first file system.

The CPU chosen for the servers is the AMD EPYC Genoa 9534, a 64-core, 128-thread processor with 4GB of L3 cache and a 300W TDP.

The memory installed is 32GB DIMMs of DDR5 ECC, totaling 384GB across all eight memory channels.

The storage solution includes two Kokisk Speedy CD6 Gen 4 and VME drives in each node, with plans to switch to larger drives in the future.

WCA supports live upgrading and downgrading of drives, allowing for easy maintenance and expansion of the storage system.

The WKA dashboard provides a clear overview of the cluster servers, showing the allocation of cores for specific tasks and the overall performance.

The system achieved 4 million read IOPS with a latency of 1 millisecond average, demonstrating exceptional performance for a file system over a network.

The server setup is designed to avoid single points of failure, with each machine in the cluster being part of the SMB cluster for uninterrupted operation.

The use of AI for media asset management allows for efficient searching and organization of the vast amount of footage, enhancing the usability of the storage system.

The server's cooling system includes four fans, with a unique counter-rotating design in the IO module for efficient heat dissipation.

Transcripts

00:00

when you make as many videos as we do

00:01

you need a lot of fast reliable storage

00:04

and our main editing server wanic has

00:07

checked all of those boxes for years

00:09

it's a great little server it's built

00:11

out of high quality components and it

00:13

even looks cool but as our team is grown

00:16

we've reached the point where even a

00:18

minute one single minute of downtime

00:21

costs over $50 and that's just in

00:25

payroll now practically speaking the way

00:27

to mitigate that is by adding redundant

00:30

now our drives are already redundant

00:32

we've got 20 drives in there with data

00:34

striping but the problem is they all sit

00:37

in one single server I'm sure you can

00:40

see where this is going it's been over a

00:43

year in the making but it's finally here

00:45

wck final form and I'm calling it wanic

00:49

10 because it's the last wever avability

00:52

W told you this like 10 times nobody

00:56

even knows what high availability means

00:58

it means it's lus just go ahead unplug

01:00

one do it go for it well okay I should

01:03

probably tell you the stakes before you

01:05

do that each of these two grand twin

01:07

boxes has four entire servers inside of

01:09

them that were provided by super micro

01:11

who sponsored this whole thing and

01:12

they're set up with WCA a redundant nvme

01:15

first file system in this config it

01:17

should sustain two entire servers

01:20

dropping out without anyone even

01:21

noticing except that we moved the entire

01:24

team onto it last night without telling

01:25

anyone and it's the middle of the work

01:27

day with a ton of high priority videos

01:29

in progress do you really want to test

01:31

it right now I like I haven't tried that

01:33

all right here we go okay what could go

01:37

wrong I mean a

01:39

[Applause]

01:47

lot naturally a huge part of a project

01:50

like this is the software the stuff

01:52

that's going to handle Distributing all

01:54

of ourish terabytes of video projects

01:58

Word documents and Linux isos to the

02:01

multiple machines that we just showed

02:02

you but we can't install any software

02:05

until we have some Hardware so why don't

02:08

we start there meet the super micro

02:10

Grand twin A+ server as- 2115 gt-

02:15

hntr despite its sort of ordinary

02:18

looking appearance and unexciting

02:20

sounding name it

02:21

is anything but ordinary and it is very

02:25

exciting because inside this 2u is four

02:28

independent Compu computers but for what

02:31

we're doing four nodes please we want

02:36

eight inside each of these is a

02:39

completely independent motherboard 384

02:43

gigs of memory an AMD epic Genoa

02:45

processor with 64 cores dual m.2 slots

02:49

for redundant boot drives six pcie Gen 5

02:54

2 and 1/2 in nvme slots up front and

02:57

we've got IO in the rear now this bit

03:01

here could be a little confusing at

03:03

first glance but that is because not

03:06

only do we have USB but we have two full

03:10

gen 5x6 pcie connections back here along

03:13

with display output and power for the

03:16

entire server this whole thing slides

03:20

into the chassis which holds a really

03:22

cool modular backplane assembly that

03:24

we'll take a look at in a minute and

03:26

then passes through thank you Jake ah to

03:29

the back at the server where you've got

03:31

a Management Port a single USB port for

03:34

each server nope it's two and they're

03:36

shared what the I was about to ask cuz

03:39

we've also got a single VGA you see the

03:42

button for two servers there no way this

03:45

button toggles

03:47

yeah and okay before we talk about that

03:50

a little bit more look at these power

03:54

supplies each of these is

03:57

2200 Watts 80 plus typ tianium which

04:00

sounds like a lot but when you're

04:02

potentially handling four 400 wat epic

04:05

Genoa CPUs along with a bunch of ram up

04:07

to 24 nvme drives and eight network

04:10

cards well it seems downright reasonable

04:12

doesn't it is it 24 drives can't be 6

04:15

yes 6 * 4 is

04:17

24 and of course that's just one of them

04:21

we've got two of those and that means

04:23

that in the event that one of these dies

04:25

the system should be able to continue to

04:28

operate uninterrupted which is a big

04:30

part of the high availability goal that

04:33

we have for this deployment speaking of

04:36

high availability let's move on to our

04:38

network cards each of those pcie gen 5x6

04:43

slots I showed you guys before

04:44

terminates in one of these ocp 3.0 small

04:47

form factor mezzanine slots and what

04:50

we're putting in them is these connectx

04:53

6 200 gbit cards from

04:56

melanox excuse me from Nvidia that okay

05:00

these are the older Gen 4 ones so

05:03

they're going to be limited by the slot

05:05

speed of around 250 gabit per second but

05:08

if we had newer cards that means that

05:10

each of these nodes could do 200 plus

05:14

another 200 400 up to

05:16

800 gigabit which would of course be a

05:19

complete waste for us a because our

05:21

workload can't take advantage of it and

05:23

B because our switch is only 100 gbit

05:28

sorry of course the two ports are still

05:30

helpful we do have redundant

05:33

switches except there's kind of a

05:35

problem here that's still a single point

05:37

of failure in a perfect world we would

05:39

have two single port Nicks so if a Nick

05:42

were to die it would still be okay but

05:45

because we have so many nodes we're not

05:47

really worried about an individual node

05:49

you know they could have one boot drive

05:51

and it die or one Nick and it die we

05:54

still have an extra backup how many

05:56

nines do you want I mean I don't know

05:59

like one would would be good 9% which

06:02

Jokes Aside is a really good point if we

06:04

were architecting this properly there

06:06

are so many more considerations that we

06:09

would need to make like the power coming

06:11

into the rack would have to come from

06:13

two independent backed up sources the

06:16

connectivity to our clients would have

06:18

to be redundant as well the connectivity

06:20

between all of the systems would have to

06:22

be architected in such a way that no

06:23

matter what fails everything will stay

06:25

up and realistically for us we're not

06:28

going to get that deep into it because

06:30

our goal is better than we had before

06:33

which was a single machine with its own

06:35

built-in redundancies but other than

06:37

that nothing now at least we should be

06:39

able to lose a full machine out of these

06:41

eight we can restart one of our core

06:43

switches totally fine two machines out

06:45

of these eight and we can still be

06:48

limping along I mean limping is a bit of

06:50

a stretch it's going to be very fast now

06:53

normally if you buy a super micro

06:54

machine they're going to pre-build it

06:55

for you they're going to validate it for

06:57

you you can even have them pre-build an

06:59

entire Rack or racks of these things and

07:02

then validate your application on it

07:04

before it ships to you in fact we've got

07:07

a whole video that we did about that

07:08

that was sponsored by super micro a

07:09

little while back of course this is LT

07:13

my friends so we will be assembling this

07:16

one ourselves do you like that spin of

07:18

the screwdriver above the server don't

07:20

worry I won't miss I'll never miss see I

07:23

could do this a hundred times and I

07:24

would never miss why no it's fine it's

07:26

good it's okay we have seven more any

07:29

who for our CPU we've gone with an epic

07:31

Genova

07:32

9534 this is a 64 core

07:36

128 thread monster of a CPU it'll do 3.7

07:40

GHz Max boost it has A4 gigabyte of

07:44

level three cache a 300 wat TDP it

07:47

supports ddr5 memory up to 12 channels

07:51

and it supports a whopping 128 Lanes of

07:55

pcie Gen 5 originally we were intending

07:58

to go with 32 core chips but they were

08:01

out of stock so free upgrade lucky us

08:04

compared to previous generation AMD epic

08:06

CPUs dooa is a big step up in terms of

08:09

IO performance which makes it perfect

08:12

for this application and in the long

08:15

term I mean if we've got all the extra

08:16

CPU cores and a whole bunch of ram

08:19

anyway why run WCA on the bare metal

08:21

when we could install prox Mox and then

08:23

use the other cores for I don't know

08:26

High

08:27

availability Plex server yeah Linux isos

08:31

more realistically it would be something

08:33

like active directory yeah which we

08:35

don't really want to do right now

08:36

because if you run active directory on

08:38

one server and it goes down you're going

08:40

to have a really really bad time but if

08:42

you run it on a bunch of servers yeah

08:45

it's good great so normally server CPU

08:48

coolers would come with their own

08:50

thermal paste pre-applied but since

08:51

we're doing this ourselves and uh if you

08:53

look carefully it's not the first time

08:55

that it's been installed we are going to

08:57

be using okay thank you for that a piece

09:00

of Honeywell PTM 7950 this stuff is

09:04

freaking awesome it has great thermal

09:07

transfer properties and it can handle

09:09

varying temperatures like seriously I

09:12

don't remember many not even just

09:13

varying but like a lot of huge cycles

09:16

for a very very long time now available

09:19

LTD store.com is that big enough does

09:21

that cover all of the ccds and

09:23

cxs oh there's a second piece of PL am I

09:26

stupid is there a second piece of

09:28

plastic no there isn't should I put one

09:29

in the fridge no no no it's totally fine

09:31

I've done this like a bunch of times

09:32

yeah oh she's Min look at that see all

09:35

right easy I would recommend putting it

09:36

in the fridge before you use it all

09:38

right to ensure we're making the

09:40

absolute most of our CPU especially in

09:42

this High throughput storage workload

09:45

we're going to be populating all 12 of

09:47

our memory Channels with 32 gig dims of

09:50

ddr5 ECC running at 4,800 megga

09:53

transitors per second that's a total

09:57

of 384 three terabytes of memory what

10:02

across all eight

10:05

oh each of the cables Jake removing

10:07

right now is a pcie by8 cable that feeds

10:11

two of the drive bays in the front but

10:13

the reason he's taking them out is that

10:15

we can install our boot drives these are

10:18

consumer grade each system is getting

10:20

two Sab 512 gig gen 3 rocket drives and

10:24

it's not because they're particularly

10:26

special in any meaningful way they're

10:28

not even that fast by modern standards

10:30

but what they are is from our experience

10:32

reliable enough and they are fast enough

10:35

for what we're going to be doing which

10:36

is just booting our operating system off

10:39

of them movie Magic all of the other

10:41

nodes are already built so what do you

10:43

mean movie Magic super micro built them

10:45

Oh I thought you buil them super micro

10:46

builds them for you I took it apart okay

10:49

fine I took that one apart no secrets

10:51

left anymore yep no Intrigue no mystery

10:53

you know what is still mysterious is

10:55

inside of here I've actually never

10:56

opened this before Oh okay let's have a

10:57

look woo holy oh that's power supplies

11:01

yeah this is so cool so the whole

11:02

computer is cooled by four fans no way

11:05

there's the two power supply fans and

11:07

then these fans in their what do they

11:08

call this like IO module I think is what

11:10

they call it look at the blades on this

11:12

thing counter rotating you're serious

11:14

that's what you're looking at not this

11:16

the most delicate of spaghet oh my God

11:19

there's not even connectors every one of

11:22

these wires is soldered directly to the

11:24

back of the ocp 3.0 what yeah for

11:28

storage we're installing ing two of

11:29

kokia Speedy cd6 Gen 4 and vme drives in

11:34

each node so we've got one that is 7

11:37

tabt and another one that is 15

11:40

terabytes they're kind of placeholders

11:42

for now and in the long term we're going

11:44

to switch to Something in the

11:45

neighborhood of about 4 15 tab drives

11:48

per node but the drives we want to use

11:50

are currently occupied by oh that

11:52

project by a top secret pastry related

11:55

project so that's going to have to wait

11:57

the good news is that when those drives

11:59

become available WCA supports live

12:02

upgrading and downgrading so we can just

12:04

pull these drives swap in the new ones

12:06

pull swap pull swap pull swap as long as

12:08

we uh don't do it all at once are we

12:10

ready to fire these things up okay

12:12

there's a lot going on here what is that

12:13

is that a switch y hey look you can see

12:15

the button now oh that's

12:17

cool what you're hearing so far is just

12:21

the Nvidia SN 3700 32 Port 200 gig

12:25

switch oh my God it even says melanox on

12:28

the front I know maybe it's an old like

12:30

review sample demo univ we got it with

12:32

the $1 million PC and I'm pretty sure

12:34

that that was already in video at that

12:35

point can you hear that you hear it

12:36

getting louder yeah

12:39

who well that one's just excited to see

12:42

this is the WKA dashboard maybe if I go

12:44

over here cluster servers we can see all

12:46

of our servers we have two drives per

12:50

and then course this is a very

12:52

interesting part of how wo works it's

12:54

not like trass let's say where it just

12:56

uses the whole CPU for whatever you're

12:58

trying to do they dedicate and like

13:01

fence off specific cores for specific

13:04

tasks for instance each Drive gets a

13:06

core so we've got two Drive containers

13:09

that means two a full core per Drive

13:13

yeah damn yeah you also have compute

13:16

cores which do like the par calculation

13:19

and intercluster communication and then

13:21

there's front end which you don't

13:22

necessarily always have frontend cores

13:25

managed connecting to a file system so

13:27

if you just had drives and Compu compute

13:29

you wouldn't be able to access the files

13:31

on this machine so you would have your

13:32

backend servers right those would run

13:34

drives and compute which is the cluster

13:37

and then on your like GPU box you would

13:39

run just the front end and that would

13:41

allow the GPU box to connect to the

13:43

backend cluster servers oh the back-end

13:46

cluster servers don't need to run a

13:48

front end unless you want to be able to

13:50

access the files on that machine or from

13:54

that machine which we want to cuz we're

13:56

using SMB we're using it as a a file

13:59

server stupid NZ for our stupid windows

14:02

machines yeah you can also have a

14:05

dedicated front end machine yes so if

14:07

you had like a 100 backend servers but

14:09

then that's adding a single point of

14:10

failure which is what we're trying to

14:11

avoid you could have multiple of them

14:13

okay you thought they thought of that

14:15

yeah I set it up so every single machine

14:18

in the cluster all eight of them are

14:20

part of our SMB cluster which means it

14:23

cannot go down realistically there are a

14:26

ton of other file systems out there that

14:28

you could use for something like this

14:30

traz has their scale out setup for

14:32

clustered ZFS which only requires three

14:35

nodes and is something we'd be quite

14:37

interested in trying out or if you're

14:39

looking for object storage there's a

14:40

million options but the main open-

14:42

source one Min iio requires only four

14:45

nodes though when we saw how nuts WCA

14:48

was when we set up the million dooll

14:49

server cluster I mean we had to try it

14:52

out for ourselves and try it out we did

14:57

so this is each not node holy

15:01

sh look okay the crazy thing is look at

15:04

the read latency now guys look look hold

15:06

on hold on hold on at 70 gabt a second

15:09

we've seen numbers like this before but

15:12

we're talking with in some cases double

15:15

the number of drives and no file system

15:17

without a file system like raw to each

15:19

drive this is with a file system with a

15:22

file system over a network and we're

15:25

only using 100 Gig ports like usually

15:29

with a WCA setup like this you'd

15:30

probably use 200 yeah cuz we oh my God

15:33

we didn't know cuz we didn't even have

15:36

networking as a factor last time all the

15:39

drives were in one box I know this is

15:41

networking too and the crazy part is

15:43

we're not using RDMA this is like um

15:45

some fancy uh what's it called dpdk I

15:48

think is the library this is wild yeah

15:52

look at that so read latency 131 microc

15:55

seconds that's 4 million read iops with

15:59

a latency of 1 millisecond average are

16:02

are we able to keep using W FS like this

16:04

is a trial okay this software is quite

16:07

expensive this is unreal 4 million iops

16:09

this is like it is unreal it's way more

16:12

than we could possibly ever need but

16:15

it's cool it's so cool don't they

16:17

support tearing and everything oh yeah

16:19

here I'll show you actually what that

16:20

looks like this is on mother vault which

16:22

I think right now has 400 Tippy bytes

16:25

left so let's say Max Capacity is 400

16:27

terabytes now once we run out of the 100

16:31

terab of SSD capacity which you can see

16:33

here it'll just it'll tear I mean it

16:35

automatically tear anyways and you do

16:38

need to make sure that your object store

16:39

is at least the same size as the flash

16:42

or bigger because they're going to

16:44

automatically tear everything to it that

16:46

makes sense so in theory we

16:48

move manually copy everything from Vault

16:53

one time to wo one time because it

16:55

stores in like 64 megabyte chunks and

16:58

then it just stays there forever stays

17:00

there forever and then we just have one

17:01

network share and when something needs

17:03

to get vaed you just you just move it

17:06

from allow it to Decay yeah you would

17:08

probably move it from pending projects

17:09

to like done or something like that we

17:11

make a folder for done yeah sure um and

17:13

then it will just do it automatically

17:15

wow or if it's a video that like

17:18

somebody was working on and then you

17:20

know it's been on hold for 3 months and

17:21

we shot you know a ter of footage it

17:23

will just and then when we're ready to

17:25

work on it it'll promote it back up holy

17:27

we K net boot off of this

17:29

followup video yeah I mean why not it's

17:31

so fast you literally could not we we

17:35

couldn't saturate this now a lot of you

17:36

at this point must be thinking gosh

17:39

Mister that's an awful lot of computers

17:41

for high availability couldn't you do

17:44

this with two and you're not that far

17:48

off the old school high availability net

17:51

app storage appliances like that one we

17:53

looked at recently did have just two

17:56

machines but those were both connected

17:59

to the same storage drives if each

18:03

system has its own drives when things

18:05

can get out of sync like let's say if

18:08

one machine has downtime you can run

18:10

into a situation where each system

18:12

believes with all the conviction in its

18:15

heart that it has the correct data and

18:18

then if all you have is two how will

18:21

they decide who's right this is

18:24

typically referred to as split brain and

18:27

that's why the majority of High

18:29

availability systems have at bare

18:31

minimum three servers this allows the

18:35

third system to be a tie breaker of

18:37

sorts in the case of a disagreement now

18:40

in our case WCA that stupid Ultra fast

18:44

file system that we're using which

18:46

unlike anything that we've used before

18:48

has been built specifically for nvme

18:50

drives not hard drives well it requires

18:53

a minimum of six nodes with a

18:56

recommendation of eight but running WKA

18:59

can still be an advantage video editing

19:02

with Adobe Premiere like we use is very

19:04

latency sensitive and even a small delay

19:07

when going to access a clip can be

19:09

enough to make the software crash so any

19:12

Improvement there is huge not to mention

19:15

that a pair of these Grand twins speced

19:17

out to the max with 128 car epic Berg

19:21

CPUs would get you just four rack units

19:25

with 1,000 CPU cores actually actually a

19:29

little more 24 terab of ddr5 and up to 3

19:33

pedabytes of ndme storage I mean h that

19:36

makes our setup seem downright

19:38

reasonable now the average W customers

19:41

are going to be a little more demanding

19:43

than us visual effect Studios AI

19:46

developers genomics Labs all the folks

19:49

out there that need Stupid Fast low

19:51

latency storage and WCA showed us

19:53

screenshots of clusters that were

19:54

reading in excess of 1 terabyte per

19:57

second consistently obviously that was a

19:59

bigger cluster but it shows you what can

20:02

be achieved with this kind of Hardware

20:04

running on I mean what used to be the

20:07

crappier option software raid man I feel

20:12

bad even calling it that these days I

20:14

had a interesting idea with the super

20:17

micro folks so you know how we have like

20:20

two pedabytes of 13 years worth of

20:22

footage thousands and thousands of hours

20:25

of footage thousands it's really cool

20:28

that we have it but it's really hard to

20:31

use unless you just happen to know what

20:34

video the thing you were looking for is

20:36

in well what if you could just like

20:38

search for something lonus Sebastian I

20:41

want every clip with lus Sebastian it

20:43

wow bam look at that shot up and let's

20:46

say you know there's this one that's uh

20:48

detected that it's you throughout the

20:50

entire clip yeah you're in a chair so

20:52

you could search for clips of lonus

20:56

sitting down with a keyboard yeah like

20:59

we're going to be able to actually find

21:01

stuff yeah right now there is a a finite

21:04

amount of objects that are trained I

21:06

mean chihuahua let me scroll through

21:08

this it's a lot eventually you'll be

21:09

able to train it and tell it hey this is

21:12

what a computer fan looks like or this

21:13

is what an SSD looks like oh my God that

21:16

is so cool so wait is this running on

21:19

these extra CPU cores or okay no not

21:21

right now faces and logos are running on

21:24

CPU yeah objects OCR and scenes run on

21:27

GPU got it but they're not running on

21:29

any of those machines they're running on

21:31

a GPU workstation that super micro sent

21:33

that's sitting at my desk um it was

21:34

Heavy anyways what is happening on that

21:37

new server is proxies because if we were

21:39

to analyze the original Clips oh AAL

21:43

formatting is a huge problem when you go

21:45

into an AI model it might not

21:47

necessarily support the Kodak that

21:48

you're filming in sure but also Clips

21:51

are like hundreds of megabytes a second

21:53

potentially that would take forever so

21:56

instead it generates proxies of

21:58

everything first first which we're

21:59

dumping to that new server and then we

22:02

can take advantage of the Lightning Fast

22:04

storage yeah you can we have 2.6 massive

22:08

compute and we can basically create like

22:11

a proxy map of what everything is in the

22:15

main archive right that is so cool so

22:18

far I've generated 2.6 terab of proxies

22:20

which might not sound like a lot but

22:23

they're only 5 megabit so it's actually

22:25

like a lot this is going to be a

22:27

flipping game Cher News

22:30

sports can you imagine your CNN you want

22:34

that person wearing a red tie yeah but

22:37

right now we've done 25,000 so 2.6 terab

22:40

is 25,000 Pro okay well let's try and

22:42

find something oh hold on once you've

22:44

generated a proxy you have to then

22:47

analyze it right ah so the analysis is

22:50

not done no not even close I've analyzed

22:53

22 Clips okay everything with Elijah

22:56

Elijah and this is the every clip that

22:59

Elijah's in and you can even see this is

23:01

so cool this is the actual ma'am as they

23:04

call it media asset manager the axle AI

23:07

guys built this before it was like AI as

23:09

far as I'm aware back when you would

23:11

have had to make comments like this

23:12

manually now it's just AI so all of the

23:16

data is in here now and we can see

23:18

here's Adam and Elijah oh that's so cool

23:21

here's all the different objects chair

23:23

flower pot microphone oh let me show you

23:27

the scene understanding thing cuz that

23:28

is so cool this is like brand new thing

23:30

they barely even worked it in but it

23:32

basic it basically takes a snapshot

23:35

every seconds two men are working on a

23:38

project in a room there is a speaker

23:41

stereo equipment there's a faucet

23:43

there's a tripod there's the tripod some

23:45

of these are a little less accurate two

23:47

men are working on a robot in a room it

23:49

kind of looks like a robot you I mean

23:51

yeah sure two men are in a workshop

23:53

looking at a laptop computer looking at

23:55

a machine there is person Alex Clark so

23:58

this is just running right now in real

23:59

time like more stuff is getting

24:01

processed as see here processing logos 9

24:04

there it is processing logos and faces

24:06

it's going to take a while yeah it's

24:08

going to take forever they're still

24:09

working on making it function on

24:11

multiple gpus so once we can get it

24:15

running on like four gpus say one GPU is

24:17

doing face detection one's doing scene

24:19

analysis one's doing object detection or

24:22

something like that we'll be able to go

24:23

a lot faster but right now it's just one

24:25

GPU got it but this is so cool all

24:27

that's left is to deploy it lonus had to

24:30

run away to do some other stuff so I've

24:31

hired some backup Cavalry Sean our

24:33

infrastructure administrator except

24:36

we've run into a bit of a problem lonus

24:38

and me and our Infinite Wisdom while we

24:40

were making this rack so much better ran

24:42

a bunch of cables right where we need to

24:45

put the server did we just start

24:46

unplugging no yeah how are we even going

24:49

to do this we have to like part the seas

24:51

exactly I started to try to move some of

24:53

the cables out of the way but they're

24:54

all twisted together so hopefully the

24:56

LTT Cable Management thing which you can

24:59

finally get at ltp store.com will save

25:01

us beautiful cable managed we can slide

25:04

a server in there now I hope you're in

25:08

yeah it's on ow ow ow ow ow ow ow okay

25:11

you're good just go that wasn't so bad

25:14

like made for it

25:18

next hey we're in now we just have to

25:22

run a million cables uhoh do you notice

25:24

anything different well it's loud most

25:26

of that's actually just the vent is on

25:28

one of the the air conditioners is

25:29

broken again but do you notice anything

25:31

different I mean the sticker's here that

25:33

that sticker's been there for years

25:34

seriously you haven't noticed anything

25:35

else well you guys uh screwed something

25:37

onto the oh did you put sauna Pan behind

25:39

it yeah but I thought this is supposed

25:41

to be a vented door my original plan was

25:44

to get rid of the vent that you put in

25:46

but that vent was there as a backup in

25:49

case the HVAC ever failed so that fan is

25:52

the exhaust and that's the intake you

25:53

see all the gaps F God there gaps but do

25:55

you notice the sound difference yeah

25:57

it's a big difference it's huge but that

25:59

server is so loud we basically ended up

26:01

where we

26:03

started yeah but that's okay I was just

26:05

trying to normalize I just mean I didn't

26:06

make it worse it's not that okay look at

26:11

that woo cute right God that's a lot of

26:14

metal if all goes to plan we could get

26:17

rid of this and this and just have these

26:20

so no more additional rack taken up

26:22

which is nice

26:24

[Music]

26:27

wow it should sustain two entire servers

26:30

dropping out without anyone even

26:32

noticing do you really want to test it

26:34

right now I like I haven't tried that

26:36

all right here we go what could go

26:39

wrong u i mean a lot the fact that all

26:42

the fans just like turned down a bit is

26:44

a little scary let's go see if anyone

26:46

noticed oh hi Mark hi I'm holding your

26:49

file server how's your edit going uh

26:51

what huh is it working it's working is

26:53

this on Wi-Fi hey Emily hey how's your

26:56

edit going I'm holding your server

26:57

that's cool

26:58

is it working are you sure yeah Hoffman

27:01

what's up how's your edit going this is

27:03

your server right here it's amazing look

27:04

feel it it's still warm wow yeah it's

27:06

it's still warm how well how's it

27:07

working it's great you know I'm editing

27:09

the video that we're shooting you are

27:10

yeah uh we're going to pull another one

27:12

wait no l you forgot one yeah here

27:14

here's another here's another one of

27:16

your servers is it working it's great

27:17

though huhuh for reference you're not

27:20

supposed to do this you should power off

27:23

the system first but we're just trying

27:25

to simulate it failing yeah a terrible

27:28

catastrophic failure I can't believe how

27:31

smoothly it handled that see all the

27:33

lights they never stopped blinking big

27:35

thanks to Super Micro for these awesome

27:36

servers thanks to WCA for making this

27:38

crazy software thanks to axle for the

27:40

awesome AI detection if you like this

27:42

video maybe check out the video series

27:44

of us building our nearly three

27:45

pedabytes of archival storage which we

27:47

call the mother Vault that thing is

27:49

awesome and we showed it to you and it's

27:51

faster now oh and thanks to you for

27:54

being an awesome viewer

Rate This

5.0 / 5 (0 votes)

Related Tags
High-AvailabilityServer-RedundancyVideo-EditingData-ManagementPerformance-OptimizationReliabilityWCK-ServerSuper-MicroWCA-SoftwareAI-Asset-Management