Principles for building Kubernetes Operators

Nov 17, 2021 by melissa

The automation of data services on Kubernetes is enjoying increasing popularity. Often engineers are surprised by the complexity when writing their first operator. In this talk the distilled experience collected from years of automating several data services is presented as a collection of principles. These principles will be presented with vivid examples using PostgreSQL as an example. Learn from the PostgreSQL operator example which is being developed in conjunction with Kubernetes end-users. See how principles guide the automation while it is the end-user’s requirements leading to the prioritization of use cases to be automated next.

At the end of the talk, you will have a better understanding how to avoid common pitfalls in data service automation allowing you to write better operators from both the technical as well as methodological point of view.

This talk was given Anynines CEO Julian Fischer as part of DoK Day at KubeCon NA 2021, watch it below. You can access the other talks here.

Bart Farrell 00:00

Our speaker today unfortunately had to go through a little bit of difficulties, but we’re hoping that it’s going to be a happy little accident. We can bring him on now his name is Julian is going to be taking us a little bit deeper on this whole topic of operators. But first of all, Julian, how are you?

Julian Fischer 00:16

I’m perfectly well, after nearly electrocuted myself. Some severe burns on my finger looks a bit funny now. But I’m back. They didn’t have to do any surgery. So I’m fine.

Bart Farrell 00:30

All right, Julian, were you trying to run a stateful workload on yourself?

Julian Fischer 00:36

Well, kind of, um, but you know, it’s a longer story, maybe I’ll share it on Twitter. Sometimes that’s okay, that was a photo of my finger, look.

Bart Farrell 00:43

This is this does invite a very, very perfect opportunity to have a great story. But anyway, just for folks that may not know Julian is the CEO of Anynines, also one of our amazing sponsors, so very, very lucky to have them on board with us. Julian, we’re gonna be talking about operators today. But I think you’re the one who’s best qualified to do that, despite some injury, you know, interesting adaptations, modifications you’ve got on your fingers. We’ll figure that all out. We got a great, great team here with us helping us out with the streaming. So I will happily turn it over to you folks, as usual, keep the questions going on YouTube. And if you want to directly interact with Julian, you can definitely do so afterwards on Slack, Julian, it’s all yours.

Julian Fischer 01:21

Well, thank you very much. Today, we’re talking about principles for building operators. In fact, you know, I’ve already been introduced to don’t want to bore you with that stuff, except maybe noticing that we’ve built, you know, automation for many data services through nearly the last decade or so. So the topic we’re talking about today, and the things I’d like to share with you a lot about general data service automation, and, and then we will look at Kubernetes, at the Kubernetes context, here and there. So it’s a bit bouncing back between the general topic of data service automation and Kubernetes, in particular. So it’s, it’s usually a one hour talk, at least, but I’m trying to get through a bit quicker today. So in general, if you talk about data service automation, one of the first things you have to do is scope, what do you actually mean by data service automation. There’s a mission statement for for for us at any nines, which is about fully automating the entire lifecycle of a wide range of data service to run on cloud native platforms across infrastructures at scale. And this is not some marketing claim here, but it’s, it’s an example on how data service automation needs to be scoped. So with the intention, for example, to automate multiple data services, you’ll also see certain sharing effects, things that you can put in a data service automation framework beyond the operator SDK, for example. And thus, the context of your mission has a lot of impact. So if you think about, for example, a simple Kubernetes cluster, let’s say a small organization that primarily run their applications, let’s say, using a Postgres database, which Postgres is always my favorite example. You know, one Kubernetes cluster one operator one service instance, applications will connect to that one database. And there you go, that’s a different story to the story we’d like to talk about here today. They can imagine that with on demand provisioning of dedicated service instances where a service instance, let’s say a Postgres database is represented as a stateful set. And the operator allows you to create many of them. There, there’s more complexity, because you have more data service instances, you have to take care of that. If you then introduce more data services, for example, you add RabbitMQ, MongoDB, or any other database, to the set of your operators, the challenge becomes even greater.

Now in the organizations that we usually work with. These organizations sometimes have hundreds or thousands of employees with thousands or even ten thousand of developers, it’s unbelievable, the amount of engineers they have and thus, then there will be many Kubernetes clusters. We think that dozens and hundreds of Kubernetes clusters will be accounted the environments that we already experience. For example, in virtual machine based data service automation, they often have 1000s, of of, of virtual machines running 1000 of service instances, depending on whether they are clustered, you can assume that there’s a ratio of one service instance to three pods, for example, if a small clustered instance is running. Now with that scale, the requirements towards automation change a lot, and scale matters. So if you solve a simple task like making sausages and handing them out, you can imagine that just by the sheer scale by the amount of people you want to serve, for example, the stack technical solution has to be adapted as well. And pretty much the same happens for data service automation as well. So if we think about those large environments, where there are a lot of those service instance, sitting around, you should never forget that each data service instance matters to someone, this service instance, matters a lot. And therefore, the automation needs to live to a certain standard. If that standard isn’t, you know, lived up to the automation will be refused by, by organizations and technical adoption will, will not occur.

Alright, so if we now think about data service with Kubernetes, a few topics come to mind. First, well, how do you implement an operator that I think the community knows how this can be done? So the most straightforward way would be to use Kubernetes, CRDs, custom resource definitions which allow you to teach Kubernetes new data structures. For example, describing your Postgres instance, you want to create a Postgres instances as plural, because we are on-demand provisioning, dedicated instances, as well as a controller who will take that specification of the object you’ve specified and turn it into something viable. So basically, what operators do is translate the specification of a primary object, such as a Postgres instance, with Postgres version 12.2 into secondary resources. And the operator SDK is, to my knowledge, the most popular way to build CRDs and generate them, as well as to get boilerplate code for your controllers. So that’s the two things we have in mind when we talk about data service automation with Kubernetes. At the same time, there’s KUDO.

If you are interested in what this is, there’s a talk I’ve given a few weeks earlier and the DoK community meetup, which is very interesting for data service automation, prototyping will not be covered here in greater depth today. Alright, stages of development, if you develop an operator, one of the challenges is what do you want to how do you approach this endeavor in a systematic way. And there’s a simple model, we call that the operational model, which is divided into four levels. That helps somebody to approach data service automation, when, you know, doing this for the first time. So as a little construct, to you know, set your mind to the task. We propose that in the first level, for example, automating Postgres. The first thing you need to grasp is what would assist or DBA do. In particular, this perspective is influenced by what would an application developer want? What exactly is it that they desire? What is the average application developer expect from Postgres, for example? Do they need clustered instances with automatic failover? Do they prefer in that case as synchronous or synchronous replication? What kind of failover and cluster manager would you like to use your preferring rep manager, or rather go with petroleum?

And, and that’s basically you’re figuring out the configuration files, you want the basic setup of Postgres, that’s all operational model level one to understand how to configure database just assuming that you have a virtual machine and you can do whatever you want, you know, install packages, and so on. So once you’ve done that, once you know how the configuration file should look like once you know, all of that could be done. You can think about containerization that could be picking existing container images and assemble them into Kubernetes specifications of stateful sets services and create a template for creating secrets, which is the YAML part in operational model level two. So at the end operate of operation model level two, regardless on whether you’ve choose chosen existing container images or you’ve created them yourselves, you also have Kubernetes specifications that you can use with kubectl, to create your own service instances manually. Once you’ve done that, once you can basically create your Postgres instance let’s say with three replicas and synchronous streaming replication, you basically know how to do that manually, then you can turn it into an operator much more easily by thinking about the problem how do you write the gde that for example, creates that particular stateful set that particular headless service that particular secret for example.

Now, if we remind ourselves that the environment we are talking about is containing potentially 1000s of data service instance, across multiple data services across many Kubernetes clusters, we also need to accept that the operator lifecycle management itself is an essential part of our toolchain. And therefore, we also need to have automation to manage the lifecycle of the operator itself. So whether this is the operator lifecycle manager, or whether this is some other technology doesn’t matter at this point, most importantly, you need to, you know, think, as this as a part of your overall data service automation challenge. Now, if you think about Kubernetes operators and, you know, then be reminded that the custom resource definitions, basically, a YAML structure like this describes a new data type that can be taught to the Kubernetes API, which will then provide an endpoint to you, as well as persistently store the specification in its etcd. So, not pretty formatted here. But you can see how such a custom resource of a particular custom resource definition will look like here, we taught Kubernetes how to create such an object. However, your CRD alone doesn’t do anything, because you need to have the controller, which then has the code that observes events, for example, that such an object has been created. And the controller then can take care of finding out whether for this particular service instance, there already exists, the secondary resources, they have a service secret, and the stateful set he wants to create. So Kubernetes controllers, basically, as I said, before they turn primary resources and translate them into a combination of secondary resources. In our example, to this point, these resources have been Kubernetes internal resources, but this is not necessarily the case. We’ll come back to that later. There. If you also want to start in writing operators there, the operator SDK makes a proposal about operator maturity levels, where the operator is, is classified into five different classes. I’m not really sure whether I would agree with all you know, the assignment of those abilities into this classes. But if you get started, it’s definitely a good place to start. And, and ask the right questions, which also in the documentation, so just as a hint, this will bring you know, a few of your thoughts into a sequence, and that’s very good. I think that if you really build, you know, operators, you need to get some, let’s say core functionality together, for example, patch level updates without a backup plan and, and backup and restore functionality. Usually, these are, you know, must have criteria and users are likely to refuse a solution or they don’t have that. But you know, you have to, at some point, you’ll start with your implementation, and therefore that plan will help you a bit. So keep it in mind. common pitfalls, well, they are very many common pitfalls, and they depend which let’s let’s exclude you know, those problems that arise from programming problems with distributed systems in general. So for example, if you know, have problems with organizing git or anything. I would say from my experience, the most probably the biggest problem with data service automation In general, is that people underestimate the complexity and effort required to do it, which has many manifestations including insufficient coverage of essential lifecycle operations, as well as other qualities such as robustness and observability being insufficient. Now, at this point, it makes sense to ask what is actually the barrier of entry, what does, you know, the automation need to do in order to be accepted by the target audience. While this is heavily dependent on the target audience itself, now a few things that I can share that I’ve learned, you know, with our organization that are important to many of our larger clients, we won’t go through them all, because it’s a bit time consuming for the little time we have but you know, accepting configuration updates is something that’s important in the degree that to the degree that the application developer is able to, you know, express with the automation, the things they’ve learned about the database and their application. So often, if the application is nontrivial, you need to tweak the database a bit. So that it really, you know, utilizes the resources as well as possible. So you need to interview the target audience and find out whether these configuration options are already in the document in the automation or not. And you need to be good in adapting your automation to particular needs. If, you know you gain more developers within your organization. Obviously, you know, all the cloud native requirements are there, like, you know being good in being observable, being infrastructure agnostic. Well, with Kubernetes, that to some degrees, you already got that. But in the context of backups, when you need to store the backups somewhere, you often have to write them to an object store. And this is where people make assumptions about the existence of an S3 API, for example, where you should rather go with some abstraction library that hides the underlying object store.

Horizontal scalability of service instances, like if you think about a service instance, you could think about a single Postgres with, you know, just a single pod, or cluster Postgres. With asynchronous streaming replication. Once you want to make that horizontal, scale out from one to three replicas, you introduce a lot of complexity into the automation, because, let’s say Postgres isn’t a simple service to automate, which makes it my favorite example. So, you will need to add a cluster manager for failure detection, you need to have a leader election and, and leader promotion logic that will help to do that. Also, if, if you happen to be in a data center with different availability zones, you want to make sure that you use them, so distributing your pods so that they won’t, you know, end up on a single Kubernetes node and if they are availability zones and your Kubernetes clusters aware of the availability zones, making use of them as something that should be should be absolutely, you know, done if possible. You know, in general, reconstructing a stateful set through the lifecycle will happen many times whether this is because of, of plans, switch overs, doing upgrades, or whether this is a vertical scale out making smaller pods larger, for example, these things are to be to be incorporated. Backup and restore, we’ll come to that, again, it’s very important, obviously, because often this is the last resort for, for application developers to recover their application without waiting for manual intervention of a platform operator. So it’s all about on demand self service, and so far that the application developer can take care of themselves, and create service instances, you know, modify them, reconfigure them in if they happen to be, let’s say, failed, and or the data has been deleted accidentally, they need to be able to recover the data within the requirements of the application, in particular to the tolerance of potential data loss. One thing that’s also not very obvious sometimes is like providing the newest data service version. Let’s say the newest Postgres version is good, more progressive users will love that. But also an organization. There could be applications that are stateful for a while they are usually in maintenance mode, they don’t, you know, evolve that much and application developers therefore need to choose be able to choose which data service version they want and managing the operator with the number of versions to support for how long with sunrise and sundown faces for all the supported versions of your automation is an essential, you know, policy that you will have to make with your automation. Because this also will, you know, utilize a lot of your team’s capacity if you provide too many versions. Well, documentation that helps you to reduce support. But also security is important encryption and rest and encryption of transit is often demanded, where you want to have, you know, the disk data residing on the disk being encrypted of the disk isn’t used, for example, as well as the data being sent from client to the data service instance. And as well as the ports in the stateful set, for example, that should all be encrypted.

Alright, so be aware that these service instances they are not, they are not something that will go away quickly. This may be the case, but for some instances that live a long life with applications, the service instance may live years. And if you think about the lifecycle use cases and the things that happened to the service instances, you’ll get a long list, the list is much longer than this one. But this gives you your first impression about the things that may happen, like scale out and scale up. Going through various version upgrades, binding applications to service instances, which I call service bindings here. But also dealing with network partitioning, and fluctuations in network bandwidth and delay.

All these things you have to take into consideration the service bindings, they represent the connection of the application to a data service instance. If you think for example, about you know, to microservices to different applications. So to say connecting to the same service instance, it would be desirable that each of those application has access to the data service instance, let’s say Postgres with a dedicated Postgres user, so that the secret is unique.

So the two things you have to do in order to make such a service binding happens, you have to create a secret where the credentials are stored. And then you have to create the actual data service user.

And that has some complexity, which we’ll come back to later. The similar aspect of data service automation that in my opinion, should be represented, idiomatic with CRDs, and Kubernetes is backups and backup plans. So if you want to create a backup for a particular service instance, describe it as a CRD. And similar to a job and a cron job a backup plan describes on how to create those backups regularly.

Now, from the methodological perspective, data service automation has, you know, there are certain principles that if you stick to them, you may benefit from it. So before we come back to the technological or takeaways, well, the principles you may want to think about is, first of all, as we’ve seen earlier, know your audience, the requirements and desired qualities are essential, you need to understand whether you do something for you know, a team that is highly dependent on a particular data source. For example, I’ve seen companies where the entire company evolves around a single MongoDB instance, or a few of them. So building an operator for that case will be different from the context I presented earlier. So be aware of that context is one of the other core ingredients to good data service automation. Choosing data service wisely is also a good thing. As I mentioned earlier, automating Postgres for example, you know, you have to go a long way. With other data services, you may run into license issues, because they tend to change sometimes licenses, some even moving away from open source licenses and well, you have to take care of that as well. So a single vendor based and backed data services with open source license that may happen in general, the idea to design for, for example, you know, this, whenever you use stateful sets you’re already using on demand provisioning, as well as where as whenever you change the stateful set, it’s very likely that the possible be recreated. So that’s what is addressed by rebuild failed instances instead of fixing them. You can use that rebuilding from a known state as a tool that helps you to fix problems, in some cases to operation model first is what I explained earlier, just understand the data serves first before you head into automation be a backup and restore hero is necessary and important because that’s, you know, the means of last resort before the operators telephone, the platform operators telephone rings, and that should be avoided. So if you happen to automate multiple data services, there’s a lot of synergy between them. And that should be incorporated into a framework. And this framework may have, you know, code base to share between controllers, but also maybe scripts that you will have in container images or well, or anything like testing is very important because with automation, you basically, you know, write some code and that code will be distributed to many environments, and from those environments, you will create many service instances. So having, you know, a good test coverage for both the code, but also for the resulting service instance and guiding them through use cases with integration tests, whatever you call this, this is very interesting. And my advices also have test cases, for those scenarios, where you know that your automation still has weaknesses, and share that test base that test cases with your client and, and tell them and allow them to run these tests in their local environment. Because that creates trust. And, and this also gives them the opportunity to get a better feeling on what are the circumstances that maybe they can monitor to avoid running into problems?

Yeah, don’t touch up upstream source code is a principle that we use, because we have so many different data services, we do forks and, and pull requests, hot fixes, temporary hot fixes are allowed. Master release management means that from the day Postgres released to the day, your automation is released, that delay should be short. And once you have released the automation, you want to deliver that release into target environments fast because only then the application developer can upgrade their instances.

Technology few words about technology not too much because we are running out of time. In writing controller when you know when you do that the first time. The reconcile it reconciliation of external resources, sometimes it’s a bit tricky. So in the example of a service binding, you need to create a secret but also a Postgres database user.

Now, there are several ways on how you can do that. But in general, the challenge will be how do you ensure that the consistency that both of these of these resources are there? Well, I think the most straightforward answer would be to have a two-level approach, you could, for example, represent the Postgres user as a custom resource as well, and have a controller for it so that the controller has a single purpose, which would be to reconcile those Postgres users, which makes them reduces the complexity in the service binding controller, when you create the secret, which is a secondary resource already known to Kubernetes, as well as the Postgres user, which then if you already have that, as a CRD, also is a resource known to your Kubernetes cluster. So as a one, one of the takeaways there is, of course, there is no atomicity guaranteed there are no transactions here. So having a declarative approach is more idiomatic, in Kubernetes, you can live with such an inconsistent state because you have eventual consistency with Kubernetes. So it should be able to, you know, be notified, you know, if the, if the Postgres user couldn’t be created, and we’ll just try to reconcile.

So having your actions and make them idempotent, so that if the loop is reconciling the same spec, again, should be possible so that you don’t get stuck and you don’t, you know, get into an error state here.

So do instead of create user just create users not exists, for example, just a simple example food for thought.

One, one last thing to share. caching is you if you modify resource in your controller, you’re basically modifying it using the Kubernetes API, your local cache may not directly and immediately reflect upon the change and this can sometimes lead to strange behavior. So be aware that there might be a lag between the update of your local cache and the update of objects in the Kubernetes API, especially if you’re dealing with controllers who somehow, you know, have a hierarchical relationship.

So, summing it up. The technology part in this talk was a bit short, we had a lot of discussions around the general challenges with data service automation, how important is to understand the target audience. And how the context affects your mission in writing your particular operator or operators. And I presented a few ideas on what are the things that are commonly requested. Maybe that’s a good place to start, if you write your own operator, ask yourself, is that what our application developers need, and if not, talk to them, try to get to their idea. And then, you know, try to, you know, make your first steps. If you come back to that talk slides will be shared, maybe there’s something you want to revisit later, and you have some benefit.

You can find me on Twitter and ask me anything. So feel free to do so I’m happy to chat about this topic. It’s one of my passions, besides of nearly executing myself while trying to build a battery. So thank you very much for your attention.

Data on Kubernetes Day Europe 2024 talks are now available for streaming!