Adding Zonal Resiliency to Etsy’s Kafka Cluster

Apr 18, 2024 by Paul Au

Kafka is an important part of Etsy’s data ecosystem, moving data that powers a number of things like analytics, A/B testing, and search indexing. Etsy runs its Kafka cluster on Kubernetes and in 2022 the data team made an effort to ensure that the cluster could withstand a GCP zonal outage. This created an interesting opportunity for the team to change the way they approached rolling out changes to the Kafka brokers. They were able to cut the time for these updates from 7 hours to a little over 2.

In this talk, Kamya discusses the changes made to add zonal resiliency to the Kafka cluster that enabled them to speed up their update process.

Speakers:

Kamya Shethia – Senior Software Engineer @ Etsy

Watch the Replay

Read the Transcript

Speaker 1: 00:00 Okay. All right. Thanks everybody for joining us. So my name’s Paul. I’m the community manager here at DOKC. I’m really excited today we have Kaia Sheia from the senior software engineer at Etsy. She’s going to be presenting some of her work that she’s been doing with Kafka over at Etsy. Before we get too far into that, I would just want to make a few announcements for the community. So yeah, first up, I just want to thank our community sponsors. Our particular gold community sponsors, Google Cloud and Perona. You can see a full list of our sponsors at the DOK website, at DOK community slash sponsors. So yeah, you can check that out. If you’re interested in becoming a sponsor, you can reach out to me. Just a few housekeeping things. So next month we will not be having a town hall just because we’ll be at K Con.

01:08 So hopefully we’ll see you there. We’ll be there for DOK day on March 19th in Paris, and through the rest of the week. So look out for us there. We can connect and chat. Next, we have an event coming up, which is day at, oh, well I messed up. We’ll be at D-O-K-D-O-K track at scale. Ignore this title. You can use a QR code to get the link to the event, but that’s going to be on Friday, March 15th. So if you can’t make it to Paris for coup Con and you’re in the West coast, you should definitely check out DOK at scale in Pasadena, and that’ll be on Friday, March 15th, starting at 10:00 AM. So check that out. And then I want to announce we have a new local meetup in Tuscany, and this is being was organized by Jonathan Battieto. And so if you’re in that area, you should definitely check it out. This is the first one. This is one of our newest meetup groups. And also if you’re interested in starting a meetup group, you should definitely reach out to me. We can help organize that for you.

02:29 Oh, sorry. And that’s the link to you use the QR code there, that’s the link to the meetup page. Yeah, and then of course the big one is DOK day at K coupon in Paris. So that’s on March 19th. You can still register late. Registration rates are in effect now until March 16th, so don’t wait too long to buy tickets if you want to go, because then the price will go up. Alright, that is the end of my announcements end so I can hand it over to Kamia. So thank you again, really glad you could make it and present to us. Oh, actually one more, sorry. Announcement. If you have questions, you can put them in the chat and then we’ll wait to ask Kaia at the end. So feel free to put those in the chat and we’ll get those questions answered later. Okay, thanks.

Speaker 2: 03:24 Cool. Can you see my screen and hear me?

Speaker 1: 03:28 Yes.

Speaker 2: 03:29 Okay, cool. Thank you all for coming. I’m Kamia. I’m a software engineer at Etsy, and today I’m going to tell you the story of how we ended up improving the speed of our deploys after making our Kafka cluster resilient to a Zal outage. So if you’re unfamiliar with Etsy, it’s basically an online marketplace for creative guts. There are about a hundred million listings on the Etsy marketplace, and those a hundred million listings translate to a lot of data that needs to be processed, analyzed, and used in some way. One of those listings is this high pot inspired spoon I won’t spot for myself, and I found this on Etsy using Etsy search. Kafka is a system that schleps over a lot of the data that feeds into a search models and Kafka is a lot of other things that Etsy do. My team at Etsy is responsible for the Kafka infrastructure and we run Kafka on Kubernetes.

04:26 I’m going to start off by giving you a brief overview on how data is stored in Kafka and how we have a Kafka cluster set up on Kubernetes. To start off with, what is Kafka? It’s a distributed streaming platform that is designed to handle real time data feeds. A Kafka cluster typically consists of multiple Kafka brokers and a Kafka broker itself is basically a distributed system component that handles a portion of this data. An individual piece of data that is received is called a Kafka message, and these Kafka messages are logically grouped into topics. You can think of a Kafka topic like a table in a database, and a message as a row. The data in a topic can then be broken up into partitions. Each partition has an independent chunk of this topic data. And to ensure resilience, each partition of a Kafka topic is replicated across multiple brokers. At Etsy, we have this replication factor set to three. So to recap, a Kafka topic can have multiple partitions. Each partition contains a unique chunk of data of a Kafka topic and a topic partition will have three replicas, and each of these replicas will be a copy of that data.

05:44 The diagram on this slide shows how these replicas and partitions are distributed across our brokers. So this diagram shows you two topics, topic A and topic B. Topic A has two partitions, partition one, and partition two in the kind of yellow, blue color, c, green and yellow color. Each partition has three replicas and is distributed across three different brokers. Topic B has one partition, and then one partition also has three replicas that are distributed across three brokers. So now that we’re all Kafka exports, let’s talk about a Kafka setup on Kubernetes. Etsy uses Google as a cloud provider, so we leverage Google Kubernetes engine to set up a Kubernetes cluster for us. We deploy Kafka as a Kubernetes state for set in its own name space in that cluster. Each of these Kafka brokers are run on a Kubernetes spot. The data for a Kafka broker is backed up by GCP persistent discs. In the previous slide, we spoke about how data for atopic partition is spread across three brokers. To tie this back into Kubernetes, this means that data for a single topic partition is spread across 3G CP pods.

07:03 Now let’s talk about fault tolerance. In the ideal case, a Kafka topic partition will have all three replicas available and ready to serve data, which is to say that all three Kubernetes pods running the Kafka brokers would be online forever at all times. But that’s not actually how things work, right? And Kafka is a distributed system and it comes with some amount of fault tolerance. We have configured Kafka topics to need at least two out of three partition replicas to be online and ready to be able to write to a topic. If you think back to the distribution of data across Kafka brokers, it means that we can stand to lose one bot without impacting the availability of our cluster.

07:49 So now let’s talk about what happens during a Kafka update. When I say Kafka update, I could be referring to several things. It’s just basically any changes we’re making to the Kafka brokers. So we could be taking a new version of the open source Kafka framework, which could have new features or bug fixes. We could be updating a broker configuration parameter, which is basically something that changes the default retention settings or compression settings per cluster. Or we could also be changing something related to the Kubernetes object that we’ve used to set up a cluster. So remember that Archive for Cluster is able to handle one bot being unavailable at a time, and we leverage this to perform live rollouts. Kubernetes has a built-in mechanism for rolling out changes to stateful set, and it does exactly this. It takes down one pot at a time and applies changes to it.

08:50 And now we’re going to talk about that process in great detail. So this process from Kubernetes is called a rolling update, and we’re going to talk about how Kubernetes and Kafka behave during the rolling update. During the rolling update, Kubernetes will take down pods in reverse chronological order. So if a state will set in Kubernetes has six pods, the update will go five, then PO four, then pod three, then part two, part one, and then finally part zero. So assuming we have N plus one Kafka broker pods, Kubernetes will start by sending a sick term signal to the broker pod N first. Remember, pods are zero index, which is the n plus one N Hang here.

09:33 Once the pod receives the stickum signal, it’ll propagate that to the Kafka process. The Kafka process will then shut down cleanly, and once that is complete, Kubernetes will delete the broker pod. Kubernetes will now start a new broker pod and with any changes that we want to make applied to it, after that happens, the Kafka process will start on the new broker pod. Before being ready to receive traffic, the Kafka process will have to recover. Recovery entails rebuilding any indexes it has and catching up on replication lag of these two things, catching up on replication, lack takes longer. So I’m going to explain that a little bit more. Kafka topic partition has three replicas available for resilience and only two out of three of them need to be online for the Kafka brokers to continue accepting rights for that topic partition. So when the newly restarted broker comes online, it needs to catch up on all the messages that had missed, and it does so by reading it from the other brokers that hold the other replicas. We do need to wait until replication lag has caught up. Before continuing on with the upgrade to ensure that at least two out of three partitions continues to stay available, we use the Kubernetes readiness probe to make sure that this replication log has caught up. Once broker bought and has recovered, Kubernetes will start taking on the next bot, and this will continue until all our brokers have been updated.

11:10 Kafka’s used for many systems at Etsy, so that means that the brokers have to be fully available during upgrades. So we had to do this one part at a time. And the process of picking down the picking down one broker at a time waiting for it to catch up on replication lag before picking down the next one used to take us about seven hours. Or to summarize my day looked like the sex KCD meme. The seminar rollout was unavoidable for us for a while until we added Zal resiliency to our Kafka cluster. Let’s talk through what it means for our Kaka cluster to be resilient to a zoneal outage. To start off with, we moved our brokers across 3G CP availability zones. We did this by using a Kubernetes feature called Pod topology spread constraints. This allows you to distribute pods across 3G CP availability zones.

12:06 So if there was a GCP zone outage, we would have some Kafka brokers available and ready to handle requests. But to ensure Kafka is truly resilient to zone of failure, we also have to make sure that the data for each topic is available. That means that we needed to distribute the replicas for each topic partition across these availability zones. So if one replica for topic partition in our cluster is on a broken zone, A, we need the next replica to be on a broken zone C and the one after that to be on a broken zone F In this way, this is Zoneal outage. Only one replica for a topic partition will continue to be unavailable and the other two will be online and ready to solve data.

12:51 This brings us to this kind of cool world where we have a predictable distribution of topic partitions across brokers for the first time. Before multi-zone Kafka, every replica for a topic partition would’ve been in a broker in zone A. But after multi-zone Kafka, we know that if one replica for a topic partition is on a bro and zone A, the second one will be on a broken zone C, and the third one will be on a bro and zone F. This makes our cluster resilient to its own outage. It also means that all the brokers in the zone could be unavailable and our cluster would kind of continue to handle requests and think back to the broker at the time requirement we had for rollouts. With this predictable distribution of topics, we know that taking down multiple broker pods in a zone will not affect topic partition availability. This is great for zoneal resilience, but it also provides us with an opportunity to restart multiple brokers at a time whenever we upgrade our Kafka brokers.

13:57 So we divide the way to do this, and then we called it zone where broker upgrades because naming is very hard. So Kubernetes actually natively provides a way to take down multiple pods in a stateful set at ones during upgrades. This is called Partition Rolling updates, and it’s kind of a part of the rollout strategy. But the caveat here is that it requires pods to come down sequentially by PO number and our multi-zone architecture placed at the third broker in the same zone. We could have reassigned partitions across BROS to kind of bring it to the distribution that we needed to, but that would’ve been a very kind of lengthy and manual process. So we opted to just write some custom logic to control these updates. And you might be wondering at this point, why didn’t we just design this differently When we were rolling out changes to make our cluster resilient to a zoneal outage, we were focused on the zoneal resiliency work and the opportunity to speed up upgrades came way after.

15:10 So the first thing we have to do is change the rollout policy for a stateful set. Kubernetes has two ways to control rollouts. The first one is rolling update, and the second one is on delete. The rolling update policy is the one we’ve been talking about so far. Kubernetes will handle rolling out changes for you. It’ll do so one part at a time. We can configure this rolling update strategy to take on multiple pods at a time using the partition update strategy here, but we’ve kind of already rule that out. The second policy, which is on delete, kind of gives us more control on which pods we restart and when changes are applied with the on delete policy changes will only be applied once a part is restarted by the user. So we switched the Kafka stateful set to use the delete policy because we wanted to control how changes were rolled out.

16:04 Once we did this, we wrote a kind of simple Python script to control the rollout. On the screen, you’ll find a diagram that shows you the main loop of our program. The script takes in the number of brokers we want to restart at once. Then the main loop of a program essentially finds some pods in a zone that haven’t been updated and deletes them. When a bot in a state for set is deleted, Kubernetes automatically brings it back up the script then will wait for the new pods to become ready and then move on to the next batch. It’ll do this until all the pods in one zone have been updated before moving on to pods in the next zone, and then the next, until all the pods in all the zones have been updated. So we obviously didn’t want to have to run something like this from our local machines, so we dockerized a little Python script and ran it as a Kubernetes patch job. The job is actually triggered from a deploy pipeline on more just a GitHub now, and we added some logic to make sure that we can’t deploy multiple versions of the same job because that would just cause all sort of chaos on our cluster.

17:14 And that’s it. We had a new zone aware rolling upgrade strategy and it could restart multiple brokers at the same time. So what does this actually do? What’s the impact? So to start off with, it improved our rollout speed. We run this script in production with the parallelism of three, which is to say that we take down three brokers at once. For us, this brought down the time it takes to roll out changes from seven hours in the past to two hours now. And you might be thinking, why not speed this up even further and restart all the brokers in a zone at once and bring this time down even further. When a brokers restarted, it catches up on the data it missed, which is replication lag. And it does. So by reading it from the brokers that were not online, this adds additional load to the online brokers, which is acceptable during a somal outage because that’s this rare event, but it’s kind of risky. And doing this during upgrades seemed unsafe. We stopped testing at a parallelism of three because we were satisfied with the improvements in speed, and we were very confident that it wasn’t affecting any of fabric availability. It was completely possible that we could push this limit in the future and cut the time down even more.

18:37 The second impact is fewer client disruptions. So we have numerous client applications that connect to our kaf cluster. Some of them are managed by our team, and some of them are managed by other teams at ad say, when a Kafka broker is taken down during an upgrade, client applications will temporarily lose connection to Kafka. These applications are generally just connect to different broker or the same broker when it’s back up and just recover seamlessly. This is the case in theory, but in practice we’ve sometimes seen upgrades be disrupted for these client applications, especially when a client application loses connections and has to reconnect to brokers several times as brokers are taken down and they come back up during an upgrade, the apps tend to get in a weird state, which is easily fixable, but it does require some sort of intervention because of this. Some teams sometimes prefer to stay vigilant when we roll out changes.

19:35 So shortening the time period for disruptions was beneficial both the client applications themselves and for the teams that manage them and reduce oil. So upgrades are necessary and important. We all understand that, and sometimes they come with a lot of new features that you have to learn about. But for the most part, all the excitement and discovery comes from testing. This changes in our dev environment, which is how it should be, and we’re very happy about that. But as a result of that, broad upgrades we consider mostly routine and something that we consider as toil. So my personal favorite way to measure the success of this project has been centered around reducing toil. Do we make our jobs easier or more pleasant in any way? And I think we did, instead of spending a day just watching graphs and making sure these upgrades are successful, we can now just get these over by lunchtime and do other things. Thank you. That’s the end of my talk if you’re interested in learning about this more. This stock is based on two goodes craft articles that my coworker Andrea and I wrote on the Etsy engineering blog. And you can find them on the links on screen. Thank you.

Speaker 1: 20:58 Alright, thank you so much. That was great. Yeah, if people have any questions, please add them to the chat. I have a few questions. So you had mentioned fewer client disruptions. Do you have any sort of metrics around that or even anecdotally what the difference was?

Speaker 2: 21:21 Yeah, we don’t have metrics, but the way the process works for us is we tend to serve a lot of other infrastructure teams. So there are other teams or other platform teams, they’re not directly working on client facing features where the infrastructure platform team, their platform, and then we work on client teams. So what we typically did was kind of notify them in advance like a week in advance and told them, Hey, the class is going to be unavailable for a period of time, maybe a day, and now we can say, Hey, this is going to be another half day affair. And the disruptions themselves were, it’s kind of an interesting issue that we have sometimes where sometimes Kafka clients get into this weird state, but they don’t refresh their metadata fast enough so they don’t receive fresh broker ips sometimes. Or sometimes when they keep losing connections, the liveness probes don’t kick off on time and they get into this back off stage. So a simple restart generally fixes it, but you have to restart this app.

Speaker 1: 22:30 And what was the impetus for the change in the first place for doing this strategy?

Speaker 2: 22:37 Yeah, I think I was assigned to perform an upgrade and I was just sitting there, well, not sitting there, I was doing other things, but I was also looking at graphs and I was like, why are we doing this? Why does this take so long? And I think I had, it was kind of just thinking about things while staring at graphs. And I messaged my coworker, Andre at the time and I was like, Hey, I have this idea, should we do this? And he was kind of on board, so we decided to make this into a mini project.

Speaker 1: 23:09 Oh, awesome. Okay. I have a question here. Oops, move my test from, and it’s how ephemeral is the data for each topic and has this increased your cost by two X based on replicating the data?

Speaker 2: 23:26 Good question. So data and Kafka topics typically have retention settings. Kafka is generally not the final location for this data, generally sent to some sort of more cost efficient storage like GCS or BigQuery in our case, because we’re kind of a Google shop here. So beta lives in Kafka for maybe a day, sometimes five days, and then it’s sent over to external system. Has this increased your cost by two times? So replication in Kafka is kind of a builtin thing regardless of these upgrades. So we replicate data across three brokers, and that’s just how Kafka is built. We need to have fault tolerance and we want to make sure that topics are available when a single what is down. And that’s completely normal in Kubernetes. It’s not very frequent, but sometimes bots will go down because like a node will go away or just something weird happens in the pod crashes. So we’re kind of prepared for that.

Speaker 1: 24:36 Okay, awesome. Yeah. Are there any other tools that you’re using to make this perform better? Sorry, make Kafka a better running, or sorry, I guess I should say, is there anything specific about Kafka that makes it better for running stateful workloads?

Speaker 2: 25:03 Yeah, so Kafka itself backs up its data with persistent discs. So I’m running this as the Kubernetes staple set basically just to make sure that if a Kubernetes spot is restarted, we kind of have a backup of this data and we’re not rebuilding or rereading data from different brokers. So first using Kubernetes works out really well. And I think when we moved to the cloud, that predates before when I joined the team, and that was kind of a decision that was made. I don’t have the history on that, but I kind of just inherited it and I think it works well.

Speaker 1: 25:47 Okay. Awesome. All right. Does anybody else have any other questions they want to ask? Wow, I think we’ve run out of questions, but yeah, thank you so much, KA. I really appreciate you taking the time to talk with us. This is really interesting and yeah, I’m glad that you can make it.

Speaker 2: 26:12 Thank you.

Speaker 1: 26:14 All right, well make sure you can watch this later on YouTube if you wanted to re-watch it. And yeah, hopefully we’ll see some of you at KU Con in March. Alright, well everybody have a nice day and thank you so much for joining. Bye.

Data on Kubernetes Community resources

Check out our Meetup page to catch an upcoming event
Engage with us on Slack
Find DoK resources
Read DoK reports
Become a community sponsor

Data on Kubernetes Day Europe 2024 talks are now available for streaming!

Adding Zonal Resiliency to Etsy’s Kafka Cluster

Watch the Replay

Read the Transcript