Kubernetes provides a way for containers in the pods to consume block or file storage. There’s also a COSI sub-project in SIG Storage that is trying to add object storage support in Kubernetes. In this session, Xing will talk about some features that SIG Storage is currently developing and designing, and highlight a few items that the Data Protection WG has been working on.

This talk was given by Kubernetes TAG Storage and SIG Storage co-chair, VMware Tech Lead for the Cloud Native Storage team Xing Yang as part of DoK Day at KubeCon NA 2021, watch it below. You can access the other talks here.


Bart Farrell  00:00

I’d like to bring on Xing now who’s going to be telling us about just what we can expect in the storage landscape. What’s coming? What’s new? What’s coming up? Xing, very nice to have you with us. 

Xing Yang  00:25

Hello, everyone. My name is Xing Yang. I work at VMware in the cloud storage team, and the co-chair of CNCF Tag Storage, and the co-chair of Kubernetes SIG-Storage. Today I will talk about what’s coming in Kubernetes storage. Here’s today’s agenda: I will go over SIG-Storage GA features in the 1.22 release and the features that we are working on for the 1.23 release. And I will also talk about what we are doing in the Data Protection Working Group in Kubernetes. And some of the cross SIG projects. 

There are two GA features in 1.22. The first feature I’m going to talk about is CSI Windows. There are challenges for CSI drivers to work on Windows, because Windows containers are not privileged yet. However, CSI drivers need to perform privileged operations such as mount or format disk, and so on. So the solution is to have a privileged proxy, named CSI proxy, that runs as a native Windows process on the Windows hosts. CSI were exposing gRPC API for executing privileged data storage related operations on Windows hosts on behalf of Windows containers like CSI node plugins. Supported protocols include NTFS, SMB, and iSCSI. On a CSI node these need to be implemented by CSI drivers to support Windows. Another thing I want to mention is the CSI proxy approach is designed based on the assumption that Windows containers can’t be privileged. However, now there is also an alpha version of privileged Windows containers. CSI Windows team is working on designing a solution to transition CSI proxy for Windows to privileged containers. The goal is to make sure minimal changes are required for CSI drivers already supporting CSI proxy to move to privileged containers. 

The next feature I’m going to talk about is this pod service account token. This feature moved to GA in 1.22 release. It allows CSI drivers to impersonate the pods that they mount the volumes for. This improves the security posture in the mounting process without handing out unnecessary permissions to the CSI drivers service account. This feature is especially important for secret handling CSI drivers, such as the secrets store CSI driver. Because these tokens can be rotated and short-lived, this feature also provides a way for CSI drivers to receive NodePublishVolume RPC calls periodically with the new token. 

Next I want to talk about a few features that we are targeting GA in the v1.23 release. The first two features are related to FSgroup. The first one is this skip volume ownership feature. If there are lots of files on a volume, Kubernetes will do a recursive change of all files on volume every time the pod starts and that can slow down the pod startup time. So we introduced this skip volume ownership change option, so a user can opt-out by setting a flag in the pod security context. This feature is targeting GA in 1.23. The second one is this CSI driver FSgroup Policy. Sometimes, FSgroup cannot be applied to volumes at all. For example, if the volume is an NFS share with root_squash option, so we have this FSgroup Policy feature for CSI policy driver to opt in. This is also targeting ga in 1.23. And the third one targeting GA1.23 is generic ephemeral volumes. This feature allows any existingdrivers that support dynamic provisioning to be used as an ephemeral volume with the volume’s lifecycle bound to the pod. This allows the PVC to be created as an ephemeral volume along with the pod. This PVC can be used to provide scratch storage that is different from the root disk. This ephemeral PVC will be deleted when the pod is deleted. There isn’t a need to write a special driver, all storage class parameters for dynamic provisioning are supported. And all features supported with PVCs are also supported. This includes storage capacity tracking, snapshotting, cloning, and volume resizing

Now, I’m going to talk about a few features that are in beta in 1.23 release. Some storage systems can apply FSgroup more efficiently by applying it to the whole volume, for example, as a mount option. So we introduced a feature in 1.22 to delegate FSgroup to the CSI driver instead of kubelet. And this is targeting beta in the 1.23 release. The second feature targeting beta in 1.23 is volume populator. This feature allows you to create a PVC from any external data source, not just another PVC, or a volume snapshot. For example, this can be used when restoring a volume from a backup. The third feature I want to mention here is CSI migration. There was an ongoing effort to migrate In-Tree cloud providers to CSI drivers, and this is already beta. In-Tree cloud providers are deprecated and targeted for removal in 1.24 release. As shown here we have AWS-EBS, Azure disk and file, OpenStack Cinder GCE-pd, vSphere-volume. These cloud providers are all deprecated and targeted for removal in 1.24. And the last feature on the slide is volume expansion. This feature has been in beta for quite some time now. We have been working on fixing issues before moving to GA. 

Now I want to talk about a few features targeting alpha in the 1.23 release. The first one is recovering from resize failure. This is the one issue we’re trying to fix for the volume expansion feature. If the storage system can’t expand the volume to the requested size due to capacity constraint, currently, there is no way to recover from this. There is a design proposal to address this issue by allowing the user to retry with a smaller value. This is targeting alpha in 1.23. And the second one is this new access mode- Read Write Pod PV Access Mode. CSI spec supports several access modes. Single node writer means that the volume can only be published once as read-write on a single node at any given time. But there was a problem on how this access mode is used during node publish volume, which is when volume mounting is happening. The CSI spec says that when node publish volume is called a second time for a volume with a single node writer access mode and with a different target path, the plugin should return a failed precondition. For CSI plugins that strictly adhere to the spec, this guarantees a volume can only be mounted to a single target path, which means the single node writer restricts access to a single pod on a single node. This behavior conflicts with the original definition. Because of that, we do not have access mode that represents multiple writers or multiple parts on the same node. To solve this problem in CSI Spect 1.5 we introduced the two new access modes. We have single node single writer that is corresponding to the new access mode read-write once pod. And single node multi-writer is corresponding to the existing PV access mode, read write once. If the CSI driver does not support these new access modes, then the behavior does not change. And the third one we’re going to talk about here is this Alpha feature- volume health monitoring. Without volume health monitoring Kubernetes has no knowledge of what happened to the underlying volumes on the storage system after PVCs provisioned and used by a pod. With this feature, CSI driver can communicate with storage systems and find out what happens to the volumes and communicate back to Kubernetes. So Kubernetes can report events on PVCs or pods if the following condition becomes abnormal. In 1.22, we moved the volume health monitoring logic from the external node agent to kubelet. And in 1.23 we are working on adding volume health status to metrics on the kubelet side. And from the controller side, the external health monitoring controller is alpha since the 1.19 release.  


And there are a few more features targeting alpha in 1.23. The first one on the slide is COSI container object storage interface. This is a sub-project in Kubernetes SIG-Storage trying to bring object store to support Kubernetes. This is modeled after CSI, but it is different from CSI. CSI is for provisioning file and block storage. File or block volume is mounted on a pod so that a pod can access to the volume. COSI is for provisioning object buckets, and object bucket cannot be mounted. It is accessed through network. Also object storage has more fine granular access control. COSI defines Kubernetes APIs to provision object buckets and provide access. COSI also provides gRPC interfaces so that a storage vendor can write a plugin to provision object buckets on their backends. And the second feature listed here is preventing volumes from being leaked when deletion is out of order. This means that PV reclaim policy should always be honored. If it is a delete then volume on the storage system should be deleted when the PV is deleted, regardless of whether PVC or PV is deleted first. This is targeting alpha in 1.23. 


There are also a few other features that are still in the design or prototyping phase. I included links for your reference, but I won’t go into details here. I want to talk about the Data Protection Working Group. This is a working group sponsored by both SIG-Storage and SIG-Apps. The data protection group is organized with a goal of providing a cross-SIG forum to discuss how to support the data protection in Kubernetes, identify missing functionalities, and work together to design features that are needed to achieve the goal. I want to highlight two features that the working group has been working on. The first one is ContainerNotifier. This is to quiesce and unquiesce hooks. We need these hooks to quiesce application before taking a snapshot and unquiesce afterwards to ensure application consistency. We investigated how quiesce/unquiesce works for different types of workloads. They have different semantics, we want to design a generic mechanism to run commands in containers. We currently have a proposal called ContainerNotifier KEP that is submitted and being reviewed. We are targeting alpha in the 1.24 release. 


And the second one is change block tracking (CBT). Without CBT, backup vendors have to do full backups all the time. This is not space-efficient, takes a longer time to complete, and needs more bandwidth. Another use case is a snapshot-based replication where you take snapshots periodically and replicate to another site for disaster recovery purposes. Without CBT, we can either do full backups, or call each storage vendor’s API individually to retrieve CBT which is highly inefficient. We are currently working on the design for this feature. There are also other projects that SIG-Storage is co-owning with other SIGs included a few links here for your reference.