For years, Apache Kafka relied on Apache ZooKeeper for maintaining its metadata and coordination. But that is coming to an end. After a lot of work in the Apache Kafka community, ZooKeeper is going away from Apache Kafka and it will be replaced with its own Raft-inspired implementation called KRaft. This is a major architecture change for all Kafka users, including those running Kafka on Kubernetes. And it also affects projects such as Strimzi that provide tooling for running Apache Kafka on Kubernetes. So, how does it work? What are the advantages? What does this change mean for the existing ZooKeeper-based Kafka clusters? What are the main challenges and limitations when using Kraft on Kubernetes? What are the changes we had to make in the Strimzi project to make it ready for KRaft? All of this was answered in this talk including a short demo of what Strimzi support for KRaft looks like.
Speakers:
- Jakub Scholz – Senior Principal Software Engineer, Red Hat
Watch the Replay
Read the Transcript
Speaker 1: 00:00 Hello everyone and welcome to CubeCon and yeah, my name is AK Schultz. I work at Red Hat and mainly I’m a maintainer of the project called Remsey, which is all about running Apache Kaf Con Kubernetes. And today I will be talking about the work we have been doing there, which is around removing the zookeeper dependency. But before we get there, little project announcement because recently as a stream we moved from sandbox to incubation. So rated it’s a big achievement for us.
00:40 Thanks. Unfortunately, it doesn’t really allow us to stop working on the zookeeper removal and just enjoy our rest, so we have to finish that. But if you ever run Apache Kafka, you probably know that for a very long time. It always depended on Apache zookeeper. You had to first start the zookeeper cluster, you had to get it running, you had to know how to deploy it, how to run it, how to operate it, and only then you can actually run Kafka because it was using zookeeper for storing metadata, for bootstrapping the cluster, for electing the Kafka controller and all these things. But that’s now changing and the zookeeper is going away, and so to say the Kafka is now released from the cage and it’s free to fly. The way it is done is that the zookeeper is replaced by something what’s called craft, which is stands for Kafka Raft Protocol, which basically is Kafka’s own take on the raft algorithm, which will replace zookeeper.
01:42 Maybe I can start by talking a bit about motivation, why this is happening. So one reason is to get rid of zookeeper. Simple as that. Zookeeper has some nice things and it does many things very well, but it has also some issues. For example, lately we faced a lot of issues around VNS resolutions on Kubernetes where a lot of our users face some problems. So yeah, these are kind of the things by, to be honest, I personally will not really cry when zookeeper is gone, but what’s also important is that removing zookeeper simplifies the deployment. You used to have the zookeeper cluster and the Kafka cluster. You had to know both of them had the knowledge for both of them, and now it’ll be just Kafka. So you don’t need to know zookeeper, you can forget everything you knew about it and it’ll be all Kafka. So that simplifies it a lot for all Kafka users, but obviously also for us in Rezi because the Rezi had to also manage and operate zookeeper clusters as well. Finally, there might be also some scalability and performance improvements, but I guess a lot of that might be visible when you are having very big Kafka clusters. When you have hundreds, thousands of partitions, then there might be more improvements than when you have a small cluster with just few hundred partitions.
03:09 Maybe some of you get the feeling that this is not really new, and why the hell is this guy talking about zookeeper removal here? And that’s not completely wrong because this is something that basically started in 2019 and today, if I checked correctly in the morning we have 2024, so it’s a very long time, but that kind of demonstrates how big change and how big for this is. A lot of that is obviously in the Apache Kafka project itself, not necessarily on the ZI side where I spend most of my time, but even on Zi side, that’s quite a lot of changes and a lot of work.
03:49 So how will Kafka look like when zookeeper is gone? The way it is implemented is that the Kafka node will now have two different roles. One of them will be the controller role. That’s basically what will replace the zookeeper clusters. That’s what will be used to manage the metadata. That’s what will be used to bootstrap the cluster. And then there will be the broker role, which will do exactly the same as the old brokers in the zookeeper clusters we’re doing. It’ll just be responsible for delivering the messages between clients, storing them, and so on. What’s interesting is that you can combine these roles in a single JVM or in a single process, which actually gives us quite a variety of different architectures where in the past we always had zookeeper and the brokers. Now you can, for example, start with a single container with a single pot, which has both the controller and the broker.
04:46 And for some development testing, this might be completely sufficient. If you would want some high availability, you can basically just scale it and run three of these containers each having both roles. And yeah, you get the high availability. It still runs in this combined mode without any zookeeper cluster or anything. If you would want to scale it a bit more, then you don’t really want to scale the controllers up and up because they will not scale forever. So you can simply decide that the next node you will add will be the broker role only, and that way you can have some mixed node and then you can have the broker node. And then finally, what we kind of expect that most production clusters will actually use, it’s quite similar to the zookeeper based architecture where you have kind of the dedicated controller notes, which do the metadata management and the bootstrapping, and then you have the dedicated broker notes for handling the messages.
05:47 Why we think that this will be what will be used in most production clusters. It’s because this way you can control how many resources will the controllers get, will the brokers get, and you get the best performance and scalability out of your cluster. But in general, each of these architectures might fit a bit different environment depends on your infrastructure, whether you run on the edge for example, or whether you run on bare metal where each node is beefy and has the same size or on VMs, and it’s quite easy for you to have smaller nodes for controllers and bigger for brokers. In Streamy, we really want to support all of these architectures because we have users using streamy for all of these different things and in these different environments, and we also want to allow the transitions between them, which is quite important if you want to start small, but maybe you want to grow your Kafka cluster. As your project grows, product grows or as your company grows,
06:50 There will be also a migration. So if you are already running Kafka cluster based on zookeeper, don’t worry, it won’t be left behind to rot away over the time you will be able to migrate it from zookeeper to the craft quorum. It’s kind of a multi-stage process where you first introduce the craft nodes, you let them syn the data from the zookeeper, then you kind of shift the traffic from the zookeeper nodes to the craft controllers, and then you definitely remove the zookeeper. So you can migrate like this, it’s not as easy to migrate the other way around. So once you run craft, then you run craft and you don’t really get to zookeeper. What’s important is that in Kafka 4.0 zookeeper will be completely removed from Kafka and you have to do this migration before you upgrade to Kafka 4.0 because you have to first within the same version migrate from zookeeper to craft, and only then you can within craft upgrade to Kafka 4.0. So that’s something what might be useful if you have the zookeeper based customer need to plan the migration.
08:03 So that was kind of the intro to how the craft stuff works and how the zuki plus Kafka look like. I wanted to talk about some challenges which we faced, which maybe might be useful also to people not just interested in Kafka, but also in other projects and tools. Some of these challenges were related to Apache Kafka itself. So as I said on the beginning, it’s something that’s going on for several years now. So obviously there have been plenty of bugs. There have been plenty of missing features which were added over time. I think that’s kind of expected. One of the other issue we faced was that a lot of the stuff which was done initially was not operationalized. So there were no APIs to manage things. There were no metrics to kind of observe the state of the system. And when you work on something like streams, which is all about automating things, then we had instructions like, yeah, read the lock, find if there are some line in the lock, and then do something that’s not really reliable and something you can automate on.
09:08 So that was one of the issues. Another issue was what I call a pre-mature production readiness, which Kafka announced K Craft as production ready quite early for my taste when it still had a lot of missing features and a lot of bug. But for a lot of users who didn’t really follow the Kafka project, this triggered a huge flow of questions on us like, Hey, this is now production ready in Kafka, why isn’t it supported in resi? And to be honest, we spent awfully a lot of time just answering these questions and explaining that maybe it’s not as production ready as it sounds in our opinion, and that it’ll take time to get there. But there’s other set of challenges which were quite unique, tore SY and to running Kafka and Kubernetes. And one of them comes back to the different architectures, which I described on the beginning.
10:03 I said that the advantage of the removing zookeeper is that we’ll have now only Kaf connotes, and that’s great on paper, but when you run on Kubernetes, it can be quite challenging because you have only Kafka noes, but each of them has a little bit different configuration, and that’s not always something that fits well because it leads to what I call asymmetric node configuration or pot configuration where each of them is kind of slightly differently configured. And to try to demonstrate how this fits Kubernetes is this is how would your old school zookeeper cluster look like? You would have, for example, three zookeeper notes and three Kafka brokers. Each of them has its own kind of idea, which has to be unique, but it needs to have a unique ID for the zookeeper. No, and unique IDs for the Kafka node, it doesn’t matter that there’s Kafka broker with the ID one and the zookeeper node with the ID one also.
10:59 Each of them has a different configuration, but they are simply different application. So if you want to deploy that on Kubernetes, the kind of obvious thing is you create two different stateful sets. Each of them has its own configuration for the application, each own configuration for storage resources, scheduling and all these things. So that fits quite nicely into the Kubernetes workload APIs. But when we move to the ZOOKEEPER plus Kafka cluster, yeah, you have only the Kaf anos, but each of the kaf ANOS has a little bit different role. Maybe here you started with three brokers and some mixed node with both controller and broker roll. Then you scaled up, added some more brokers, then you decided that maybe it’s better to run the dedicated controller nodes instead of the mixed node. So you started adding the dedicated controllers and then you will remove later the mixed controller nodes.
11:59 So it can easily have the Kafka cluster in a state like this. Now, all of these nodes, they share the same sequence for the node IDs. So they start somewhere and end somewhere, but you can’t have their two Kafka notes with the ID one. It has to be unique. So that’s kind of another limitation. And each of them has a different configuration for the application itself, but also for the resources for storage and so on. So the brokers, they will have the broker configuration, they will have the CPU and quite a lot of memory. They will have definitely a lot of storage, the mixed nos, they might have the same hardware configuration or resources configuration, but we’ll have a different configuration for the application. Then you have again, some brokers which have the broker stuff, and then you have the controller only nodes, which will have only the controller configuration and we’ll possibly have less resources because they don’t need as much resources as brokers.
13:01 So you have kind of only Afghan nodes, but you can’t easily go and represent them with a state full set. So this is something what we had to deal with, and it’s funny because something similar was discussed in one of the previous talks about the Mongo operator. We actually took the decision to leave the state full sets behind and we moved to managing the pots ourself directly internally. We are actually using our own resource for it, which is called Stream support set, but it’s kind of just to better separate the code and better kind of design internally the code base. It’s not something the users will be using. But what we did is we introduced another custom resource, which is called Kafka node pools, which is used by users, and it’s used to define the configurations for different groups of the Kafka node. So you can create a node pool for the brokers, you can create a node pool for the controllers and so on, and you can of course change it as well.
14:03 And to make this a bit more challenging, we had to roll this out while maintaining some backwards compatibility, making sure that the existing streams users can easily move and upgrade into these new things and so on. So that alone was quite a big challenge for us, and it pretty much took the last two years of my life, so hopefully it’ll be worth it. And how does it look afterwards? Is that, yeah, we have now only Kafka noes, but each of the group of the Kafkas, it’s kind of treated separately and it’s kind of using the Kafka node pool to be configured. But because we don’t use stateful sets, we can then easily play with the different IDs, how they will be shared between the nodes and so on, which set would not really allow us. Okay, let me see if my demo works. I wanted to show how you would deploy the Zuki plus Kafka cluster with resi. The way you would do it is basically you would as before use the kind Kafka resource, which will be used to define kind of the Kafka cluster entity. That’s what contains the different shared configurations which apply to the whole Kafka cluster. So for example, the version of Kafka use is currently here, and it’s always shared between all the nodes. The listeners are kind of shared. The Kafka configuration is shared as well. So these are in the Kafka resource as they were before.
15:40 But then the user basically adds the Kafka node pool resource, which defines the different groups of nodes for the Kafka cluster. So in this case, I specify that I want to have these nodes with both the controller and the broker role so that they have both of them. I say that these nodes should have three replicas. I define how many resources should these nodes have, and I also in this case, define what the storage should be used by these nodes and when would I do cube kettle apply? Then if I write it correctly, I get something like this where you can see that I have the Kafka cluster named my cluster and it has three different nodes. And if I would dig into the pots and check the configuration, you would see that it has both of the roles for broker and controller and they can manage the cluster, but also send the messages. But what I can do is I can simply take another resource, so another Kafka old pool, and in this case I specify only that the role should be broker. And I again specify how many replicas do I want, what should be the resources, what should be the storage. In this case, it’s just a simple demo cluster. So the resources are actually the same, but you could of course have them different.
17:12 And then if you notice here I have here some special annotation to make it a bit easier to manage the different id. So I say here that I want these notes to be numbered 100 to 199 and not continue 3, 4, 5. And now when I do cube cu apply and watch the ports. Then what we can see is that we have these three new pods starting here. They are named my cluster broker 100, 101, 102, and these are the new brokers. I’m just adding to my cluster. And if you would already have the node pool, you can add third fork node pool, but you can of course also do just cube scale to add additional brokers to the pool as well. And maybe once these brokers are ready, I decide to migrate to this other architecture where I have the dedicated controller node and the dedicated broker notes. So what I can simply do now is I can do cube cuddle, edit, Kafka, Kafka, which is the Kafka, no pool resource with the mixed rows are created before, and I can simply go and remove the broker roll from the cluster. I can apply this, and when I watch the pots again, I can see that the old pots are now being rolled by the operator and they are being reconfigured from having both of the roles to having only the controller roll. So if you remember the slide with the different architectures, I basically just now moved from the architecture with the mixed nodes to the architecture, which I called as a most production ready with the dedicated controller nodes and with dedicated brokers.
19:20 So that’s the demo.
19:25 So actually I cheated a bit because of the time, I didn’t have any clients there, so there were no data. If there would be some data, then the operator would automatically block the change and you would have to reassign the data to the brokers. For example, we have a cruise control integration, so you could use cruise control to kind of move them to the new brokers, which I created, and then the operator would remove the broker role, but that would take several minutes to kind of execute. So that’s a bit tricky for a 25 minute talk, sorry.
Speaker 2: 20:09 Basically
Speaker 3: 20:16 You need to bring the cruise control, do all the needed, and only after that you can remove the roll and maybe it could be implemented with the operator, just you are removing the brokers. The data should be evacuated from there.
Speaker 1: 20:33 So I hope that one day we get to these things such as triggering the cruise controller balances automatically when these changes happen. To be honest, we don’t have that yet. It’s only something we are planning or talking about, and one of the reasons is that we spent the last few years working on craft. But yeah, it’s a good point that it’s something that we are thinking about, but already today, you don’t have to use the APIs of cruise control yourself with coral or something like that. We have custom resources, which you can use to kind of trigger the removal of the node. So it’s not completely automated, but if you would want, you can for example, write some simple shell script to do the automation yourself.
Speaker 4: 21:15 Thank you.
Speaker 1: 21:24 So the question was if the brokers are being run on the same node in this case, they are probably not because in the background is OpenShift cluster with multiple worker nodes. But
21:40 Yeah, I just simplified the Yammer because if I had all the affinity rules, topology spread constraints and so on, then the YAML would be super long if you want, in the autumn, in Chicago, in the date on Kubernetes day, I did a talk about making your Kafka cluster production ready, where I cover kind of all of these things you should consider for production like affinity, configuring, direct awareness, and so on. There’s YouTube recording, so that might be a good thing to check for all the things which I kind of omitted from the demo to make it a bit easier.
Speaker 5: 22:14 I asked this because they 100 brokers, they were running quite fast, so that means the notes still were already there or this is what
Speaker 1: 22:25 Yeah, the worker node already existed there.
Speaker 5: 22:27 Okay.
Speaker 1: 22:30 I wanted to touch a bit on the different risks, which I might see behind this implementation. One of them is around managing the pots ourself, because that means that you take over a lot of responsibility, which otherwise Kubernetes does for you. When you create a stateful set, then the pots are actually managed by Kubernetes, not by you. And that means if there is some disaster in your Kubernetes cluster, and if for example, an availability zone goes down or some nodes crash, then it’ll be Kubernetes who will be kind of responsible for risk spinning the pots. When you manage it yourself without a stateful set, then now it’s who’s responsible for it. And if something fails, then yeah, you might have a problem. And let’s be honest, the stateful set controller and Kubernetes is way more used and way more proven than remsey. Also, I think we have our share of users, so that’s kind of one risk.
23:26 Another risk, I’m a bit curious to see how much do people get used to the combination of the Kafka and Kafka pool resources? I don’t think we are the only operator using multiple customer resources like this, but it’s a big change from what we did before. So I’m a bit curious how do people get used to it. I also wanted to touch on where we are today, just last week, released from Z zero 40, which is a big release for us in terms of craft support because it enables it by default. So you can now just choose on your own whether you want to use craft or zookeeper based clusters. And it’s also the first version, which adds support for migrating the existing streams based, zookeeper based Kafka clusters to craft. So yeah, if you have some existing clusters, you can try that as well.
24:20 There are also some limitations, which I don’t think are necessarily production critical, but yeah, let’s be open about them and let’s let everyone decide if they want to run crafting production or not. For example, there are some limitations around scaling the controllers. There are some limitations around unclean leader election. These are actually in Kafka itself. On the ZI side, we for example, have some limitations around the jbo support for using multiple discs in the craft nodes. Hopefully this will be all addressed later this year. But yeah, I think in many cases you can use it in production even without it. And finally, going back to the timeline, which I shared on the beginning, and let’s talk a bit more about what’s next. Later this year, probably October or something like that, Kafka will release the version 4.0, or that’s current plan. Of course, plans change sometimes. So in the Autumn Kafka release, Kafka 4.0, we should remove the zookeeper for good, and then probably early 2025, we will kind of follow that in rims and remove support for zookeeper from STMs as well. I guess we will try to provide some bug fixing and CV fixing for the last version with support for zookeeper for some time. But yeah, this is for someone maybe going a bit too quick, but you can’t stop the progress. So sooner or later, every Kafka user will have to move to Kraft.
25:55 And that’s it. Thanks for your attention, and I will be around for the rest of coupon. So if you would’ve anything else around zi or Apache Kahan, Kubernetes, or even about this stock, then feel free to stop me whenever you see me or ping me on the CNCF slack and so on. Thanks.
Data on Kubernetes Community resources
- Check out our Meetup page to catch an upcoming event
- Engage with us on Slack
- Find DoK resources
- Read DoK reports
- Become a community sponsor