CFP for DoK Day at KubeCon Eu 2024 is open until Dec 4th.

Submit Proposal!

Disaggregated Container Attached Storage – Yet Another Topology with What Purpose

Container native storage is designed for Kubernetes and managed by Kubernetes. Ideally, it’s based on disaggregated NVMe storage for scalability and performance. 

DataCore Chief Scientist Nick Connelly, shares the basics of disaggregated storage and how it offers scalability advantages for Kubernetes. This talk was given for DoK Day EU 2022, you can access all the talks here.

 

Nick Conolly  00:00

My name is Nick Connelly, the Chief Scientist at DataCore software. In this talk, we’ll run through a brief history of storage to provide some context. Then we’ll look at disaggregated storage and its relevance to container-native storage. OpenEBS, and in particular, the Mayastor engine, will give us practical examples before we draw some conclusions. Long ago, disks were primarily installed inside servers and connected with SCSI/ATA/SAS. For magnetic media, you’d get somewhere between 5-200MB/sec, with a latency of around 10msec. Data services were either provided by the operating system, perhaps in the file system layer, or by a hardware RAID controller. A hardware RAID controller offloads the host system and provides performance and protection for the data. The physical storage is aggregated and then striped to increase performance by keeping all the spindles busy. Data protection is provided by mirroring or by a parity-based scheme to reduce costs. Storage is then presented as logical disks to the host. 

There are a limited number of disk slots inside a server, so an expansion chassis called the Just a Bunch of Disks (JBOD) is often used. It’s what you get, and there’s not usually a RAID controller in the chassis. The JBOD can be dual-ported, which gives you two connections to the storage. But this is primarily for redundancy in the event of a path or controller failure. It doesn’t automatically give you shared storage. If that’s the goal, use a storage area network instead. The development of fast networks means that storage can be shared over FibreChannel or iSCSI, with minimal impact on latency and throughput. Centralized pooling provides easier management, better capacity utilization, and the potential for data services implemented in storage arrays. 

A storage array connects multiple disks to the network as a single unit. They’re designed to be highly available and reliable. They include redundant paths and controllers and protect the data with some level of RAID. Arrays are managed independently of the service they connect to and have their control plane. Typically, data services such as caching, thin provisioning, HSM deduplication, compression, encryption, and snapshots will be provided. Servers are now fast enough that the custom hardware and an array can be replaced with a general-purpose server and a stack implementing software-defined storage. This means that storage is now a software problem. It’s just another application. Software-defined storage stacks can run on a variety of platforms, whether physical, virtual, or containerized. Hyper-converged infrastructure combines software-defined storage with virtualization. Software-defined networking and a management plane is the integration of storage into the virtual environment. It’s often composed of identical nodes where the storage is included locally for easy scalability. But the one size fits all approach can limit flexibility when expanding the installation.

NVMe is a real paradigm shift. It’s a faster protocol designed for high-speed flash, with simpler storage commands and memory-based command and response queues. The protocol supports up to 64,000 queues for lock-free access on systems with a high core count that will otherwise suffer from bottlenecks due to heavy contention. Performance, especially with Optane drives, is outrageous; throughput of 3-7 GB/sec and latency of about 10usec. There’s just no comparison with magnetic media. It’s orders of magnitude faster. The network latency is now significant compared with disk performance, and this drives the trend back towards local storage. But the same issues remain silos capacity, reducing overall utilization, and lack of centralized data services. NVMe over Fabrics extends NVMe across the network with a design goal of adding no more than ten usec of additional latency. It supports several transports, including FibreChannel, RDMA, InfiniBand, and TCP, and allows access to remote NVMe-based disks as if they were local. Storage arrays are starting to support NVMe over Fabrics. 

A new storage model called disaggregated storage is emerging based on the near local performance of NVMe. Over fabrics. Individual disks are dynamically mapped across the network to remote hosts, avoiding the I/O blender effect and enabling pool workload design and scalability. Disaggregated storage offers the capacity utilization and manageability of non-local storage. Data services either have to be implemented at the host, or they require full virtualization of the storage with a high-speed storage stack. One thing to watch out for here is the storage is now so fast that one or two disks can saturate the network. New hardware designs would allow the disaggregated storage stack to be offloaded to an IPU or GPU in a similar way to a RAID controller. These PCIe boards contain an array of ARM cores, onboard RAM, high-speed networking, and have the ability to present an NVMe disk to the local host. Data services can be implemented on the board. IPU/DPUs are currently being targeted at cloud service providers, but we may get to see them used more widely in the future. Whether it’s running on an IPU, DPU, or on the host system, how do you build a disaggregated storage stack? A good starting point is to use the storage performance development kit. Storage Performance Development Kit (SPDK) is a well-regarded storage stack that can deliver the full potential of Optane. It consists of a set of tools and libraries for writing high-performance, scalable user-mode storage applications. Its cutting edge leverages the latest NVMe features and uses polling with a lot less thread per core design; production ready, portable, flexible, and open source with a permissive license. What’s not to love about it? 

Now, this is great, but what does it have to do with Kubernetes? We need to find a storage model that’s a good fit for a scalable containerized architecture with a declarative control plane. Let’s call it container attached or container-native storage. It’s the integration of software-defined storage into Kubernetes. In much the same way as hyper-converged infrastructure integrates storage for virtualization. Container native storage is designed for Kubernetes and managed by Kubernetes. Ideally, it’s based on disaggregated NVMe storage for scalability and performance. Replicas are essential to provide persistent data on ephemeral infrastructure. Let’s look at an example of this in practice. OpenEBS is the leading open source container-native storage solution. It’s a CNCF sandbox project with a vibrant community that has recently released Mayastor 1.0. — a production-ready NVMe-based storage engine. Mayastor includes a high-performance NVMe and NVMe over fabric stack built using SPDK and a control plane that is designed for Kubernetes. It delivers disaggregated N-way synchronous replicas with automated replacement, and a controller per volume is storage agnostic, supporting basic pooling from heterogeneous storage and is designed for running at scale. Since it is portable with no kernel components, it’s implemented in Rust for memory safety. In this example, the persistent volume presented by the Mayastor engine is mirrored to two other nodes using NVMe over Fabrics, creating three synchronous replicas of the data. 

What conclusions can we draw about the disaggregated storage model? Combined with NVMe over Fabrics in a container native storage solution, it can offer significant scalability and performance advantages for Kubernetes. If you want to get started with this technology, I’d recommend experimenting with the OpenEBS Mayastor engine. Thank you so much for listening.