Data on Kubernetes Day Europe 2024 talks are now available for streaming!

Watch Now!

Kanister & Kopia – An Open-Source Data Protection Match Made in Heaven

During this talk, Pavan Navarathna demonstrates how two open-source tools, Kanister and Kopia, work together to optimize backup and recovery for Kubernetes applications.  Kanister allows domain experts to capture application-specific workflows in Blueprints to extend and share. Kopia uses state-of-the-art encryption algorithms (AES-256 or ChaCha20) and compresses data to save bandwidth and storage.


Today I’ll be talking about two open source projects, Kanister and Kopia, and how they work together to create backup and recovery of stateful applications on Kubernetes. I’m Pawan Navarathna, an Engineering Manager here at Kasten by Veeam. My team and I are focused on finding creative solutions for data management problems on Kubernetes. I’m a maintainer for the Kanister project and a contributor to Kopia whenever an opportunity arises. From my previous DoK talk in LA, I covered the importance of data management and the different flavors available for protecting cloud-native applications. Let’s have a brief recap of these approaches.

As we all know, cloud-native applications comprise various components, including persistent storage volumes, databases, and their corresponding Kubernetes resources. We have a nice presentation with Postgres, MySQL, and the equivalent storage volumes here. The whole thing makes up a cloud-native app. The simplest way to protect these apps would be by taking storage-centric snapshots. These snapshots are provided by the underlying file or storage provider. In most cases, they are crash-consistent and the fastest option available for data protection. However, these storage-centric snapshots don’t talk to the application at all. The database or application layer is unaware of the captured snapshots. Hence it’s not application consistent. We can utilize some APIs provided by these databases or data services to freeze and flush the application during the snapshot process to overcome such issues. Once again, the actual snapshots are storage-centric and have the same capabilities as the storage-centric snapshot. 

The third type would be the data service-centric approach, where the tools from these databases like MySQL and pg_dump are used to take snapshots of the database. This approach provides database-level consistency; the data is out-of-band, which means you could potentially restore it on a different kind of storage. Generally, the restores for these types of data service-centric snapshots are complex. Each of the approaches we discussed has its pros and cons in terms of speed, consistency, and cost. The optimal strategy that applies to our cloud-native application depends on the needs of the applications and the infrastructure capabilities that are available to us. Each cloud-native application includes different domains or components. Each component has different owners, and they have their requirements and own concerns. None of them can be overlooked when protecting the application. There are different types of backups that these component owners may need. Some might use logical backups of the databases, volume snapshots, or even something provider specific. If we are using RDS-hosted Postgres, then they may need RDS snapshots. Some applications also need a way to scale up or down during the snapshot process. This is a choice that we need to make.

Finally, there are also different types of targets for the backups. Once the backups are taken, you may find yourselves asking, “Where do we store it?” or “How do we move it to a different type or external storage?”. Given these complex workflows and different moving parts, a comprehensive data protection tool has main goals or requirements. First, the tool has to be application-centric. The main focus should be on the business continuity of an application. Keeping the application always running with its data intact is the main goal. The tool has to provide a way to coordinate and implement all the complex workflows in an application-centric manner. 

We know the backups can be easy to take based on the different methods we discussed, but the tool should also make restores easy. Further, there are some compliance requirements users may need; hence the tool has to take care of the minimum number of snapshots available. For any tool, security and reliability are of utmost importance. The tool has to provide some way to authenticate or authorize certain people to modify or capture the snapshots of applications on Kubernetes. Then, the data is stored or moved across the Kubernetes cluster out to some external storage. This data must always be encrypted to ensure there are no disasters.

When we store backup data, the tool has to provide efficient storage to save storage space and then efficiently transfer this data to external storage if required. There are also the compute costs of running the tool itself. Hopefully, the tool has to be efficient while using compute data on the Kubernetes cluster. There is freedom of choice, and the tool has to support different storage backends, Kubernetes platforms, and other things the user desires. Does such a tool exist? Absolutely. It is in the form of a combination of two open-source projects, Kanister and Kopia. 

Kanister is an open-source framework. It’s purpose-built for application-level data production on Kubernetes. It is mainly implemented as a Kubernetes controller based on an operator pattern. We can define and execute database or application-specific workflows using a set of Kubernetes CRDs. The workflows are defined in blueprints and executed using action sets. Profiles are used to define the storage targets that we use to store the backups. Further, Kanister is secured by Kubernetes RBAC, so the rules can be defined while installing the controller itself. It is a Helm chart; thus, it’s easy to install and define these rules. The workflows we have in the blueprints are defined in the form of Kanister functions. These are easily extensible and shareable. Once we define a blueprint for a particular workload, stateful set, or deployment, we can reuse that on multiple workloads. 

There are also a lot of qualified blueprints. We have four widely used databases on Kanister. While Kanister handles all the application-level data operations, we can use Kopia to manage the backup data efficiently. Kopia is an open source tool that manages file system snapshots in remote storage, and it calls this remote storage space a repository. It is secure and reliable. All the data being transferred or stored locally is always encrypted. The backup data written to a repository is immutable whenever supported by the storage back end, meaning that the data cannot be accidentally or maliciously modified or deleted. Further, Kopia helps storage space and bandwidth during data transfers. It does this by supporting content-based data deduplication. This means the backup data uploaded once is never re-uploaded based on its content. Hence, if the same content is presenting the backups, it’s not re-uploaded. It also supports data verification and configurable compression. For freedom of choice, it supports a variety of storage backends where the backup can be stored; Amazon S3, GCS, Azure, and many others. 

Now, let’s take a look at how Kanister and Kopia work together. Let’s assume that a database is also deployed on the cluster where the Kanister controller is deployed. Then, we installed a blueprint specific to that database. First, we create an action set CRD. This defines which actions should be done. On creation, the Kanister controller reacts to this and discovers the blueprint that the action set is pointing to. It gets the backup workflow, and then you interact with the database workload using a Kanister function. This Kanister function uses Kopia within the execution of that function. As we execute the Kanister function, we then take a database backup and use Kopia to upload it to the object or whatever store defined by the profile setting the action set. Now, a Kopia backup repository is created at that location. Once all these stages and phases run, the Kanister controller saves the Kopia snapshot information in the action set status. The controller can later use this information to restore the data from the Kopia repository. 

I mentioned earlier that we have examples of blueprints in Kanister. These widely used databases have blueprints, and wherever you see a small Kopia logo, these are Kopia-based blueprints that we have tested. Currently, the automation for creating all the Kopia setup needed is being worked on. It’s a work in progress, and we are still working on that. It should be available soon. Regarding backup targets, I also think we can create profiles in Kanister for Amazon S3, GCS, and Azure. There are also operator blueprints that we have written for Kafka, Crunchy Data PostgreSQL Operator, and Cassandra. These work as wrappers for whatever backup operations or supports these operators have.

I couldn’t do an actual demo but feel free to visit the Kanister Github page. You may access the demos there, even on the previous DoK day. Try both Kanister and Kopia, and let us know your feedback! Contributions are always welcome. If you are at KubeCon in Valencia, please do visit our booth there so you’ll meet some maintenance and get some Kanister flagged. Thank you for having me!