This Server CANNOT Lose Data…

Linus Tech Tips
21 Mar 202427:57

TLDRThe video script discusses the development and implementation of a high-availability storage solution named 'Wanic 10' for a video production team. The team, facing significant costs due to server downtime, turns to a redundant system with two 'Grand Twin A+' servers from Super Micro, each housing four independent compute nodes. The servers are equipped with AMD EPYC Genoa processors, NVMe drives, and are managed by the WCA file system. The system is designed to handle the team's extensive video projects with high throughput and low latency, and includes innovative AI capabilities for asset management and content analysis.

Takeaways

  • 🚀 The video production team requires fast and reliable storage due to the high volume of videos they produce.
  • 🌟 The WCK server, named 'Wanic 10', has been developed to address the growing storage and redundancy needs of the team.
  • 🔄 High availability (HA) in server systems means minimal to no downtime, which is crucial for the team's workflow and cost-effectiveness.
  • 💻 The 'Wanic 10' server is equipped with two Grand Twin A+ servers, each containing four independent compute nodes with powerful specifications.
  • 🔧 The server setup includes 384 GB of memory, AMD EPYC Genoa processors with 64 cores, and multiple NVMe drives for high-speed data access.
  • 🔌 The server features redundant components, including power supplies and network cards, to ensure continuous operation even if one fails.
  • 📊 The team migrated their entire workflow to the new server without any issues, demonstrating its reliability and performance.
  • 🛠️ The software aspect of the project is critical for managing the vast storage and involves distributing data across multiple machines.
  • 🔍 The AI detection and analysis capabilities allow for efficient searching and organization of the extensive video footage archive.
  • 📈 The new server setup aims to improve the team's productivity by handling high-priority video projects with minimal latency.
  • 🎥 The video production team's new storage and server infrastructure is designed to handle their growing needs and ensure seamless operation.

Q & A

  • What is the main issue with the current storage and server setup described in the transcript?

    -The main issue is that the existing server, while reliable and of high quality, has become insufficient for the team's needs as they have grown. The problem is that all drives are in one single server, which poses a risk of significant downtime and financial loss in the event of server failure.

  • What is the significance of the term 'high availability' in the context of the script?

    -High availability refers to the design and redundancy of the server setup to ensure that it remains operational even in the event of hardware failures. The goal is to minimize downtime and maintain consistent access to data and services.

  • What are the key features of the Supermicro Grand Twin A+ server mentioned in the transcript?

    -The Supermicro Grand Twin A+ server features four independent computers within a 2U chassis, each with its own motherboard, 384GB of memory, an AMD EPYC Genoa processor with 64 cores, dual M.2 slots for redundant boot drives, and multiple PCIe Gen 5 slots for NVMe drives and network cards.

  • How does the new server setup aim to mitigate the risk of downtime?

    -The new server setup aims to mitigate downtime by incorporating redundancy at multiple levels. This includes having multiple servers, each with its own independent hardware, and using a high-availability file system like WCA. Additionally, the setup includes redundant power supplies and network cards to ensure continued operation even if individual components fail.

  • What is the role of the ConnectX-6 200 Gbit cards from Nvidia in the new server setup?

    -The ConnectX-6 200 Gbit cards from Nvidia are used as network cards in the new server setup. They provide high-speed connectivity for data transfer, which is crucial for handling the large amounts of data involved in video editing and storage workloads.

  • What is the significance of the use of DDR5 ECC memory in the server configuration?

    -DDR5 ECC memory is used to provide higher performance and reliability. ECC (Error Correcting Code) memory can detect and correct bit errors during data transfer, which is essential for maintaining data integrity in a high-throughput storage environment.

  • What is the purpose of the WCA (WekaIO) file system used in the new server setup?

    -WekaIO (WCA) is a high-performance, scalable, and resilient file system designed for data-intensive workloads. It is used to handle the distribution of large amounts of data, such as video projects, Word documents, and Linux ISOs, across multiple machines in the server setup.

  • How does the new server setup handle the potential for 'split brain' scenarios in high availability systems?

    -The new server setup avoids split brain scenarios by using a minimum of six nodes with a recommendation of eight for running WekaIO (WCA). This ensures that in case of a disagreement or failure, there is a majority of nodes that can act as a tiebreaker, maintaining data consistency and system availability.

  • What is the role of AI in the new server setup and how does it enhance the user experience?

    -AI is used in the new server setup for media asset management and scene understanding. It can analyze video footage to detect objects, logos, and scenes, making it easier for users to search for and access specific content. This greatly improves the efficiency and user experience of working with large volumes of video data.

  • How does the new server setup handle the transition of data from SSD to object storage?

    -The new server setup uses a technology called 'tearing', which automatically moves data from high-speed SSDs to object storage when the SSD capacity is full. This process ensures that frequently accessed data remains on the SSD, while less frequently accessed data is moved to object storage, optimizing performance and storage efficiency.

  • What is the significance of the cable management solution used in the new server setup?

    -The cable management solution is crucial for maintaining the efficiency and accessibility of the server setup. It helps to keep cables organized and reduces the risk of accidental disconnections or damage, which could lead to system downtime.

Outlines

00:00

🚀 Introduction to High Availability Storage Solution

The paragraph introduces the need for a fast and reliable storage solution due to the high volume of video production work. It discusses the limitations of the current server, 'Wanic,' which has been effective but insufficient as the team grows. The concept of high availability (HA) is introduced, emphasizing the importance of minimizing downtime. The solution involves adding redundancy and upgrading to a new system called 'Wanic 10', which promises improved HA capabilities.

05:01

🛠️ Unveiling the New High Availability Server Configuration

This section delves into the technical specifics of the new server setup, highlighting its high availability features. It describes the transition to a more robust system with two 'Grand Twin A+' servers from Super Micro, each containing four independent compute nodes. The server's impressive specs are detailed, including 384GB of memory, AMD EPYC Genoa processors with 64 cores, and multiple NVMe slots. The goal of the new setup is to ensure that even if a server fails, operations continue without disruption.

10:03

🔧 Customizing the Server for Optimal Performance

The paragraph focuses on the customization and assembly of the new servers, emphasizing the importance of selecting the right components for optimal performance. It discusses the decision-making process behind choosing the CPUs, the application of thermal paste, and the installation of memory. The paragraph also touches on the potential uses of the server beyond its primary function, such as running a high-availability Plex server or active directory services.

15:04

💡 Installing and Testing the New Storage System

This section describes the installation of boot drives and the internal structure of the new storage system. It explains the use of consumer-grade drives for booting the operating system and the plan for future storage expansion. The paragraph also covers the initial testing of the system, including the reaction of the team when a server is removed unexpectedly to demonstrate the high availability feature in action.

20:05

🌟 Impressive Performance and Future Plans

The paragraph highlights the impressive performance of the new storage system, particularly its ability to handle a high number of IOPS and maintain low latency. It discusses the potential use of the system for various types of workloads and the advantages of using the WKA file system. The section also explores future plans for the storage system, including the use of AI for content analysis and the creation of a proxy map for efficient media access.

25:09

🔄 Handling Server Failure and Cable Management

The final paragraph discusses the practical aspects of managing server failure and cable organization. It describes a test where a server is removed to simulate a catastrophic failure and the system's seamless response. The paragraph concludes with a discussion on cable management solutions and the potential for further optimizing the server rack setup.

Mindmap

Keywords

💡High Availability

High Availability refers to the design and maintenance of systems to ensure they are accessible and operational as much as possible. In the context of the video, it is a critical requirement for the server setup, as any downtime can result in significant financial losses. The video describes a system that aims to achieve high availability by using redundant components and configurations, allowing for continued operation even if individual servers or components fail.

💡Redundancy

Redundancy in the context of this video refers to the duplication of critical components or systems to ensure their continued operation in the event of a failure. It is a key aspect of the server design, with the goal of mitigating the risk of downtime. The video describes having multiple drives, servers, and network cards, which provide backup and failover capabilities to maintain system performance.

💡WCK

WCK in the video refers to the final form of the server setup, named for its high availability features. It represents a system designed to handle the demands of a growing team and the need for minimal downtime. The name WCK is symbolic of the server's ability to sustain operation even if parts of it are unplugged or fail.

💡Super Micro

Super Micro is the manufacturer of the servers used in the video. They provided the grand twin boxes that form the backbone of the high availability setup. Known for their high-quality components and server solutions, Super Micro's products are used here to create a powerful and reliable server infrastructure.

💡NVMe

NVMe stands for Non-Volatile Memory Express, a protocol designed to take full advantage of the speed and performance capabilities of solid-state drives (SSDs). In the video, NVMe is used in the context of high-speed storage, which is crucial for the server's ability to handle large volumes of data quickly and efficiently.

💡WCA

WCA is the software file system used in the video to manage and distribute data across the server cluster. It is specifically built for NVMe drives and is known for its ability to handle large amounts of data with low latency, making it ideal for high-performance computing environments.

💡AMD EPYC

AMD EPYC is a series of high-performance processors from AMD designed for server and data center environments. The video mentions the use of AMD EPYC Genoa processors, which are known for their large core count and high memory capacity, making them suitable for demanding tasks such as video editing and data-intensive workloads.

💡DDR5

DDR5 is the latest generation of Double Data Rate (DDR) Synchronous Dynamic Random-Access Memory (SDRAM), offering higher bandwidth and improved power efficiency compared to its predecessors. In the video, DDR5 is used to equip the servers with fast and efficient memory, which is critical for handling large data sets and ensuring smooth operation.

💡OCp

OCp (Open Compute Project) is an industry standard for open source hardware designs, aimed at improving the efficiency and scalability of data center infrastructure. In the video, OCp is referenced in relation to the small form factor mezzanine slots, which are used to connect high-speed network cards in the server setup.

💡AI Detection

AI Detection refers to the use of artificial intelligence algorithms to identify and classify objects, faces, or scenes within digital content. In the video, AI detection is employed to analyze and catalog vast amounts of footage, making it easier to search and retrieve specific clips based on content.

💡Proxies

Proxies in the context of the video are lower-resolution copies of the original high-resolution video files that are generated for the purpose of faster analysis and editing. They allow for efficient processing by AI algorithms and reduce the computational load on the system.

Highlights

The team's need for fast and reliable storage led to the creation of a new server called 'Wanic 10'.

Wanic 10 is designed for high availability, ensuring minimal downtime and maintaining productivity.

The server is equipped with high-quality components, including 20 drives with data striping for redundancy.

Super Micro sponsored the project and provided two Grand Twin A+ servers, each containing four independent computers.

Each server has 384GB of memory, an AMD EPYC Genoa processor with 64 cores, and dual M.2 slots for redundant boot drives.

The servers are set up with WCA, a redundant NVMe-first file system that can sustain two entire servers dropping out without notice.

The team moved their entire operation onto the new server without any issues, demonstrating its reliability during a busy workday.

The server's design includes a modular backplane assembly and two 2200-watt, 80 Plus Titanium power supplies for high performance.

In the event of a server failure, the system should continue to operate uninterrupted, aligning with the high availability goal.

The server features OCP 3.0 small form factor mezzanine slots with ConnectX-6 200 GbE cards from Nvidia for networking.

The team is considering using the extra CPU cores for tasks like running a high-availability Plex server or Active Directory.

The server's CPU coolers use Honeywell PTM 7950 thermal paste for excellent heat transfer and reliability.

The server is populated with 32GB DIMMs of DDR5 ECC memory, totaling 384GB across all eight nodes.

For storage, two Kokiska Speedy CD6 Gen 4 and VME drives are installed in each node, with plans to switch to larger drives in the future.

WCA supports live upgrading and downgrading of drives, allowing for easy maintenance and expansion.

The WKA dashboard provides a clear overview of the cluster's performance and health, with each drive getting a dedicated core for optimal performance.

The server demonstrated an impressive 4 million read IOPS with a latency of 1 millisecond average, showcasing its potential for high-speed data access.

The team is exploring the use of AI for efficient video editing and content management, leveraging the server's capabilities for proxy generation and scene analysis.

The server's design and configuration are expected to significantly improve the team's workflow and efficiency in handling large video projects.