Published March 27, 2022
This talk introduces Druid operator and how Kubernetes and Operator framework can be used to write an operator that enables provisioning, management, and scaling of a complex cluster of Apache Druid to 1,000 of nodes.
This talk covers why Kubernetes and the operator framework is a perfect fit for managing a complex stateful application. Learnings and pitfalls faced while writing the Druid operator and the special considerations to be taken into account when managing and scaling a stateful application.
Bart Farrell 00:00
This is always fun. All right, we’re going live, we’re very much live, we are super live. We are live and we are here direct with everybody. Very happy to be in our meetup. I believe this is officially announced as meetup number 91. If I’m mistaken, if it’s meetup number 92, I apologize. But when you get to so many meetups, just gets a little bit difficult to keep track. A couple of things before we get started, Data on Kubernetes Community will be celebrating the DoK day very soon in Los Angeles. We were talking about that earlier, different ways of pronouncing that word. So if you have not signed up already, it’s totally free. It’s very easy to sign up. We already got 2000 people that have signed up. Don’t worry, we have capacity for everybody; no one’s going to be left out. Very easy to get involved. We’ve got a bunch of talks planned for that, we still got a couple of panels that we’re gonna be telling folks about. We got a big announcement that’s coming out tomorrow. No spoilers! Not going to tell anybody. We also have on October 6: next week, we have DoK Students day. If you’re more on the beginner side, just getting started with Data on Kubernetes, we will be having about 30 talks, 25 of which will be given by students, most of whom have never given a talk before. So, that’s also a very exciting day for us. It’s the first time we’ll be doing nothing and definitely not the last doing a DoK Students day; very excited about that too. Now, for today, we’re gonna be talking about an operator, not the first time we talked about operators. Today, we’re going to be talking about the Apache Druid operator. More importantly, with our extremely friendly, intelligent person works for Rill Data. His name is Adheip Singh. He was also telling me some other names that he has, but we can get into that later. Adheip, it’s very nice to have you with us today. Can you just tell us a little bit about who you are, where you’re from and what we’re going to learn about today with you?
Adheip Singh 01:41
Thanks, Bart, for the nice intro. My name is Adheip Singh and I’m currently based in Bengaluru, India. I’ve been working onto Kubernetes since 1.3 was released. So, pretty familiar in this ecosystem.
Bart Farrell 02:00
Really quickly, if folks don’t know, when was version 1.3?
Adheip Singh 02:06
I guess two and a half years back.
Bart Farrell 02:08
That’s pretty good. That’s pretty like – Oh, gee!, like original. You have a good amount of experience and that’s cool.
Adheip Singh 02:15
Yes, it was approximately in 2018, when I started with that. I work for Rill Data, which is an Apache Druid managed platform, a SaaS platform for Druid. My day to day role is mostly around Kubernetes, development of the operator in Go and running Big Data pipelines on Kubernetes. So, a lot of StatefulSets and data being ran on Kubernetes.
Bart Farrell 02:44
Wow, that’s perfect for our community. That’s exactly what we want to hear; that is music to my ears. And talking about that, what are some of the challenges of working with getting Big Data pipelines on Kubernetes? Is it difficult?
Adheip Singh 02:56
Yes, it’s extremely challenging and difficult. That’s why operators are for solving those problems. Otherwise, you could just simply run Helm and StatefulSets. The main challenge of running Big Data is they were not designed to run on Kubernetes and containers, so you need to build a bridge between what your application wants and what Kubernetes interprets. And you want to build that logic, so it’s like Kubernetes does not understand – ‘Hey, I’m scaling the Druid pod or Kafka pod’, it just knows I’m scaling the pod. But the ways of scaling, handling certain edge cases are different. You want to build that application logic inside Kubernetes, so that’s where the operator comes in. It’s basically more of reconciling what you want and how it can be done on Kubernetes. So that’s where the challenge lies. And plus StatefulSets always have their own challenge – data in different parts of the cloud environments, different zones and how do you handle all of those, StatefulSet deletions, PersistentVolumeClaims (PVC); you use different kinds of Storage Classes when provisioning PVCs. So, all of them have their own challenges when you want to run things on scale. That’s what I’ve been kind of working on for past one year with Rill, so it’s been great and challenging all together.
Bart Farrell 04:29
Very good. We’ll probably be unpacking a little of this as we go on. And like I said, operators is a pretty solid topic that we’ve had occurred again and again. Looking at all the different operators that are out there, seeing the process of building an operator, do you have the staff that can dedicate the time that’s necessary to do that? How long is it going to take you? What is the value that is going to be provided? Seeing all this, this sort of operator ecosystem is interesting. And also thinking about, if not operators in the future, what might there be? We can get to those questions later. If you want to share your screen right now, just as a reminder to folks in the audience, feel free to ask questions whenever you want in the chat and we’ll be happy to answer them accordingly.
Adheip Singh 05:19
I hope my screen is visible.
Bart Farrell 05:20
It looks great.
Adheip Singh 05:25
Apache Druid on Kubernetes. I’ll just quickly go with the agenda what I have for today. We’ll do an intro of what Apache Druid is, how and when to extend Kubernetes, whether I should use a Helm on an operator – this is one of the most common questions you get asked, and what is the difference between an operator and a controller, introduction to the Druid operator, then how do you scope the Druid operator – watching namespaces and all that stuff. Then, understanding the Druid CR like how is this CR scoped off and then some of the operator’s features like how does it handle orphaned PVCs, how can you scale Druid nodes vertically – volume expansion, what are Kubernetes Finalizers, rolling upgrade of Druid nodes, we also have a small kubectl-druid-plugin to help with managing the Druid CR. There’s also a Druid-k8s-extension, which removes the dependency on ZooKeeper. So, a brief intro about that, and something about how we are running Apache Druid in Kubernetes at Rill.
What is Apache Druid? Apache Druid is a real-time analytics database designed for fast slice-and-dice analytics on large data sets. It’s columnar; it’s OLAP; it’s an Apache foundation project. It’s been quite popular in the OLAP query system. Most common use cases are clickstream analytics, risk/fraud analysis, server metrics storage. How Druid is designed is that it has like five or six nodes. To run a Druid cluster, you require a minimum of five nodes and it has various ways of ingesting data. It has a lot of connectors, you can ingest data into Druid using Kafka / HTTP / S3. So, it basically supports real time and batch both.
I’ll just go through the Druid architecture a bit. This architecture is what the operator deploys on Kubernetes. Druid is divided, on a higher level, into three servers: master, query and data servers. It also has external dependencies, which is the Metadata Store, the ZooKeeper and the Deep Storage. So Deep Storage is like an object storage where all your data is kept safe. ZooKeeper is generally used for leader-election and service discovery, whereas Metadata Storage keeps the record of your metadata, so it can be an SQL or a Postgres. Middle Managers and historicals are the data nodes. Whenever you kind of submit a spec to a Druid cluster to ingest the data, the Middle Managers are responsible for indexing it. They kind of run those tasks in real-time. As soon as a task’s spec is submitted, once the ingestion starts, you can query the data. So, that’s what Druid is popular for real-time. And once that data goes to Deep Storage – it’s a segment, it’s immutable in nature; that segment is pulled onto historical nodes and kept in disk for query. You can segregate your historical data into hot, cold and into further tiers. Routers route your queries to the brokers and brokers further query, either the Middle Manager or the historical depending upon where your data is. Master servers are basically coordinators and overlords. Coordinator handles all the coordination between all the Druid components. And you can also run coordinator and overlord inside a single pod. So that’s what we’ve been running along. So, this is the Druid architecture.
Now we want to run Druid on Kubernetes, and now the point comes in when and how can I extend the Kubernetes and why should I? In terms of Druid or any Big Data systems, they are complex and they were not designed to run on Kubernetes. So, you want to make sure that Kubernetes understands application specific knowledge. Kubernetes does not know I’m running a Druid node or a Druid pod, it just sees everything as a pod or a StatefulSet or as a deployment. So, you want to build a bridge between what your application understands and what Kubernetes understands. The operator is basically a reconcile loop, a state management, which says – ‘Hey, what is your current state and what is your desired state’. So, it basically makes a transition between those two.
Kubernetes gives us a lot of options to extend. You can extend by using a custom API server, which is again, the API aggregation layer all together where you’ll have your own API server and your own etcd. It’s a very big thing to build out. In terms of Druid, we have a custom resource definition, which is a CRD GroupVersionKind; you can use those. Custom schedulers, if you want to have one of your own scheduling algo in terms of scheduling pods. Something about admission webhooks, whenever you do a kubectl get pods, it goes through authentication, authorization and admission control. Once your request is authenticated, it gets authenticated through your XYZ line certs on your kubeconfig, then it gets authorized through our back and then there is an admission controller. There are two types, which is validating and mutating webhooks. So, you can have your own validation and mutating webhooks accordingly. And then the request goes to the etcd and gets persisted as a manifest. You can also have your own cloud controller managers – Google Cloud has its own, AWS has its own. And these are again, operators, controllers, kind of, I can just submit a manifest to Kubernetes and it will provision S3 bucket for me. Custom controllers are basically typical in shared informers hooked together and they are usually any one of your custom use cases. If any image is running on your Kubernetes cluster and you want to validate it, you’ll kind of just scan that image; so something like basic custom controllers.
This is one of the questions which even in the community, we get asked often: when should we use… Should we use a Druid Helm chart or a Druid operator? And what’s the difference between those two? The answer is you need to use both. Leverage both of them. Understanding that Helm is a configuration management for templating large manifest. It’s just a templating tool, what Helm does not do – it does not maintain the state of the application. So, once I’ve deployed Helm upgrade install, that’s it. After that it’s the Kubernetes job to make the manifest into its native objects. What Helm does have, Helm as a revision number for each Helm chart it deploys and it is useful for rollbacks. Helm helps in version control of the charts, you can have your own application and your own charts bundled together. What the operator does is operator maintains the state of the application. Operator is a bridge between Kubernetes and the application. Ideally, in this scenario what you would want is that the CR can be templated out as a Helm chart and can be applied, and the operator can reconcile that CR and at the same time if that CR is not appropriate or you want to rollback, you can use Helm to rollback, the operator will again reconcile the state. So, this is what we have been using in Rill and this is pretty helpful in managing our configurations. As I said, both can be leveraged to the best of the use cases.
Operator versus Controller. Operator is a word coined by CoreOS, if I remember correctly. And operators can have multiple controllers within them. Controller is the core component; you have deployment controller, StatefulSet, endpoint controller. Each Kubernetes object has a controller because each Kubernetes object has its own definition, which is a group, a version and a kind. I’ll get to those later. So, controllers are a core component in the Kubernetes. If you see this for loop, so what a controller does or an operator does – it gets the desired state, which is your custom resource, gets the current state and then the reconcile loop makes sure the desired and the current is what is interpreted in terms of all their objects deployed. So operators, more specifically, will watch custom resource definitions, whereas controllers can be generic, they can just talk to native Kubernetes object without having a custom resource definition or you need to apply a manifest. That’s the only differentiator; apart from that mostly operators and controllers are same, they can be packaged together. And that’s what the Druid operator does.
Bart Farrell 15:23
One thing really quickly, one of our speakers one time said that CRD is his favorite feature on Kubernetes. Do you agree? Yes or no is fine. And if not, what is your favorite feature?
Adheip Singh 15:38
Yes. CRD is, of course, one of my favorite features. When you want to start development into Kubernetes, it gives you various options to start – introspecting the code base, you can develop CNI plugins, you can develop API servers, schedulers. I found that controllers are fun to start off with, and when you kind of work on controllers, eventually you will, of course, work on CRDs. They are the most easiest way to extend Kubernetes quick fast. And plus, a lot of frameworks are already available with the Kubebuilder and operator SDK. So yes, definitely.
Bart Farrell 16:18
Adheip Singh 16:21
Intro to the Druid operator. What is the Druid operator? It’s in druid-io repo and it’s been almost two years since the operator was open sourced. I’ve been contributing to this operator since it was two weeks old. Initially, it was using the operator SDK framework, and then it later went to Kubebuilder. It’s built in GoLang. GoLang is the primary language for when you need to develop and extend Kubernetes because the framework in Kubernetes itself is written in Go, so it helps in development. The GroupVersionKind is druid.apache.org; version is v1alpha and kind is Druid. The Kind is what the reconcile loop or the operator looks for, it says like – ‘Hey, I have a manifest with kind Druid. I’ll now reconcile the state’. So, it reads through the manifest and creates all the objects. Group is druid.apache.org, same way you have apps/v1/statefulsets, where your StatefulSet is the Kind and apps is your Group and Version is v1. The same way you have a custom resource definition according to your use case. Druid operator can enable provisioning, management and scaling of complex Druid clusters up to n nodes. You can keep on adding your Druid nodes and the operator will keep on creating them and scaling them out. How the operator runs? It just runs as a deployment and we have Helm support for that installation. To create Druid clusters, operator needs a custom resource which the Druid operator shall reconcile. So it’s like CRD, then you create a CR out of … the CRD is this spec you submit to API extensions. API, which is the primely group to extend Kubernetes extensions.
Scoping the Druid operator is something which is… when you want to do a deployment, you want the operator to watch a single namespace or you want the operator to watch multiple namespaces? As of now the operator supports to watch single namespace and it also has another feature known as Deny List. What you can do is I want to watch all the namespaces but kube-system default ABCD namespace, so the Deny List is basically predicates which will filter the namespace events from being included into the WorkerQueue. I’ll kind of go over the last line in the diagram. I have a diagram to show how the predicates and the WorkerQueue works. But the operator watching multiple namespaces is still in a pull request, should be merged soon. I have tested it out on my clusters, looks good to me. Deny List can be used if you want to exclude certain namespaces that – ‘hey, you can watch everything but not these namespaces’. So, you have three options to scope the Druid operator to.
Druid Custom Resource is what is the CR. You can create multiple CRs and depending upon how your operator is scoped, it will watch all of them and it will deploy. Druid operator supports Druid deployment only in a distributed mode; that means all your historicals, coordinators, overlords will be deployed as separate parts, not into one. Druid has a quickstart.sh and all those testing capabilities where you deploy everything in one VM itself, but the operator does not have any of those modes, it will deploy it in a distributed mode. The Druid CR can be scoped into… How the CR is defined is you have key-value pairs specific to all nodes and key-values common to all nodes. Let’s say, I want to define an image. Image is, let’s say, apache/druid:0.220. That image can be common to all your nodes, all your five or six nodes you’re deploying, or this can be specific to five or six nodes, where you want like- ‘Hey, this image should only go to historicals, where I want to have a different image for Middle Managers’. So, you have the option to scope them out as and when you need. It’s user friendly, the spec.
Druid nodes can be deployed as StatefulSets as well as deployment. Initially, how the operator was written was – it was only supporting StatefulSets where everything was deployed as a StatefulSet. But now, you have the option of deploying StatefulSets as well as deployments. What is StatefulSet in nature? Historical and Middle Managers are StatefulSet in nature, they need persistence to store the segments and data. Whereas brokers, coordinators and routers, they don’t need much of a persistence and they can be just used as deployments. You can use either. So, you just need to mention Kind in your node spec. Once you mention Kind, it can be StatefulSet or a deployment; but the Druid operator defaults to creating StatefulSet, so you need to explicitly mention deployment.
How does validation happen off the Druid CR? If I’m submitting a Druid CR, what if my image is not mentioned and if your image is not mentioned, what will the operator do without an image? You need to validate the CR which is coming in. The operator has both: it has a schemaless CRD, schema CRD. The schemaless CRD does internal validation, so it will be validated on the reconcile loop. Whereas, on the Kubernetes API level, there is OpenAPI schema which will validate your CRD to make sure it’s appropriate for the operator to reconcile.
Handling of orphaned PVCs. This was one of the features which we saw while we’re running the operator at Rill and built this. A common example is: you kick off ingestion in Druid; when you want to load data into Druid, you provision like 50 or 100 of Middle Manager nodes and scale them out to N number. When you scale Middle Managers to let’s say 100 pods, So it’s like 100 pods are running, which is 100 replicas of one StatefulSet is running. And when the injection is done, those pods are of no use, so you will scale them down back to one. Whenever a scale down happens, those PersistentVolumeClaims which are provisioned by the StatefulSet are orphaned and they are of little or no use. They’re just adding storage costs. In those kinds of scenarios, you want them to get cleaned up. So, what the operator does is it will see that in your CR, if you have defined a Middle Manager and any of the PVCs are orphaned, it will clean them up. This feature is optional; default is set to false. So, to enable that you just need to set it to true; deleteOrphanPvc is the key and it’s a Boolean. So, the operator will handle deletion of orphaned PVCs.
Another feature which we again… running Druid on Kubernetes came up that scaling Druid nodes vertically. The historical nodes, it comes up mostly with them, can be scaled horizontally as well as vertically. There are some Druid specific scenarios where you don’t want to scale your nodes horizontally and you would like to add more storage to them. Example is: I have historical nodes of size 30 Gi and I want to add 20 Gi of more storage to it. In this kind of scenario, what happens is StatefulSets don’t allow change in volume claim template. If you see that you cannot directly edit a StatefulSet and say ‘hey, in the volume claim templates to increase from 30 to 50’; the StatefulSet will error out and Kubernetes does not allow this. Earlier, without this feature, it was a manual intervention and you had to do your own hack workaround. So, this feature was then once it was built into the operator, what the operator does – it performs a cascade deletion of the StatefulSet and patches the size of the PVCs owned by the StatefulSet. In my CR, once I change it to let’s say 50 Gi from 30 Gi, what the operator will do – it will see that there is a change in that desired and the current size. So, what it will do is it will delete the StatefulSet – it will not delete it, it will perform a cascade deletion which means that when your StatefulSet is deleted, it is just orphaning your PersistentVolumeClaims as well as your pods. So, it is not cleaning up your pods. At that particular point when there is no StatefulSet, your pods are still serving requests; they can still serve traffic to your Druid cluster. Each time on the reconciliation, the operator does create a new StatefulSet. So, it will create it. This is work in progress PR; it’s almost done. We have been testing it internally at Rill, it just needs to be merged. You’ll see that soon in the main branch of the operator.
The next thing is Druid Operator Finalizers. Finalizers is again a Kubernetes concept where you want to run pre-deletion hooks, which alert the controller to clean up resources that the deleted objects owed. A typical example of a finalizer is: I’m running a pod with the PVC mounted on it and that pod is owned by a StatefulSet. If I delete that PVC, you see the PVC goes into a terminating state but does not get deleted. If you do a “-o yaml” of that particular PVC, you will see that in the metadata, there is a deletion time step set and there is a finalizer also present on that PVC. That means that till the time you don’t clean up your StatefulSet, PVC will not be cleaned because it’s already being used by a pod. In the same way, what our use case for the operator was whenever we delete Druid clusters… A large Druid cluster can run lots and lots of nodes, StatefulSets, PVCs – like 50 or 60 PVCs are running and you delete the Druid cluster, at that particular point, all your PVCs are orphaned and only PVCs are left. So, what the operator does is it makes sure that whenever you issue a delete to the CR, a deletion timestamp is set and the operator will make sure it clears the StatefulSets, PVCs and the whole Druid cluster is cleaned up without leaving any Kubernetes objects pending, orphaned on your particular namespace. This feature is default, you can turn it off by setting a flag on your CR. But by default, this is how the operator behaves.
Druid Operator Rolling Upgrades. This is again a very useful feature which is in the operator. Rolling upgrade in typical terms can be – I want to roll out a new image of Druid, I want to change the common.runtime.properties. What happens is Druid operator gives you two options: either you can upgrade your cluster in parallel, which means that all the nodes will get upgraded at the same time or you can do a sequential rolling restart or a rolling upgrade in parallel. What the operator does is… as per Druid documentation, it’s defined that these historical, overlord, Middle Manager, indexer, broker, coordinator, router – these are the order in which Druid needs to get upgraded. The operator will make an array of all these nodes and at each iteration it will upgrade these nodes. Operator also has a function which is an object fully deployed; which means that it will always wait for the object to get fully deployed and then move on to the next node. For example, I’m upgrading historical and at a particular point, due to ABC reason, the historical pod goes into crashing state. And at that point, I don’t want to move to overlord / Middle Manager, that same faulty spec can lead to… my whole cluster can go into a bad state. What the operator will do, it will just halt at that particular point and it will wait till a user intervenes and fixes that faulty error in the CR. So that’s how the operator does rolling upgrades. Pretty useful rolling upgrades! We have been using it internally, so haven’t seen an issue so far.
A lot has been talked about the Druid operator and how things are working. At this point, I’ll just like to introduce a certain diagram on how the Druid operator and the controller works. If you see, operator is also a controller. So, we have a Druid controller, and we have a manager. Manager is basically controller-runtime, the library which the operator uses underneath. Few projects, which the operator is using, is Kubebuilder, which was used to scaffold the operator. Kubebuilder uses the controller-runtime library to build out all the reconciliation loops. We are not following all the semantics of Kubebuilder, it’s just that the Druid operator is one of the complex operators. So, We have our own internal client, as well as the upgrade pattern is a bit different from the de facto as per Kubebuilder docs.
I’ll quickly go through the Druid controller. The first thing is, whenever you create a custom resource, you have a group-version, you have a schema builder and you do an add to a scheme. That CRD GroupVersionKind needs to follow a particular scheme of the Kubernetes API. So, you register that scheme with the API server, which is done in the manager. Informers are another concept in the controllers, where whenever an operator is making an API call, you don’t want it always to call the Kube API – just to list pods, get pods, delete StatefulSets and number of things. Druid informer is a cache which the controller uses to query. The controller-runtime has its own client, which reads from the cache and it can write directly to the API server. There is a watcher and there is a predicates event filtering. Whenever an event comes, it can go through a list of predicates; in the Druid operator specific, we are using the predicate to filter out namespaces. What the Deny List was… if you’re watching all the namespaces and you have a Deny List for Kube-system in default, and in that particular scenario, you go and apply a CR in default namespaces, though you don’t want it to reconcile. In that case, the operator will just filter it out before sending it into the WorkerQueue. All items in the WorkerQueue are basically of name and namespace; name is your CR and namespace is whatever the operator is scoped to and is watching. So,the events come to the reconcile loop, which is where I’ve defined what is the Druid controller. The first thing the operator does is it validates the CR, then it creates a list of Druid nodes, each node having a node type. Each node having a node type is basically that historical is of a node type – historical, Middle Managers are of node type – Middle Managers. Basically, it is a map of a string to a node spec, which means your string can be anything; you can have: Middle Manager 1, Middle manager 2, Middle Manager 3, that means it will be like three StatefulSets, whereas each node type will be of Middle Manager. So, whatever decisions the operator makes is on the basis of the node type. And if you don’t have a node type mentioned in your CR, the operator will not reconcile, it will throw an error. Then the operator, what it does is it creates config map, service deployments and all the Kubernetes objects which are needed for the Druid controller to run. All your common.runtime.properties, runtime.properties, which are specific to Druid nodes are created as config maps. These config maps are mounted onto Druid pods from where the Druid reads all its configurations from. The operator also has an internal client. If you go into the operator repo, it has an interface.go, where there is a reader and writer interface. Now, there is also an emitter interface to emit metrics and events. The internal client interface talks… has the SDKclient.client embedded inside it and which works in sync with the SDK to do all the jobs. This is just a raw diagram, which I had kind of built.
The next thing is Kubectl-Druid-Plugin. This was built while working on Druid. A Druid CR can go up to 1500 to 2000 lines of YAML. And whenever you want to make a change, you don’t want to scroll through that YAML file and make changes, which can be a pain – 2000 lines of YAML, scrolling them down. In those kinds of scenarios, what we did was we built the kubectl-druid-plugin. If I want to scale Middle Managers to five or scale them down / up, I can just use the kubectl-druid-plugin to do those operations. I don’t need to edit the CR and change replica counts. So basically, Kubectl can be extended to write custom plugins. The kubectl-druid-plugin is simplifying Kubectl operations on the Druid CR. You can do update as well as patch operations. You can update an image, you can patch your nodes, you can list all your nodes and CRS, and it has support for six to seven commands. So it’s pretty simple.
There was this extension, Druid-K8s-extension, which was authored by one of the Druid PMCs. Basically, Druid uses ZooKeeper for service discovery. Once this extension is enabled, you don’t need ZooKeeper. What this does is, Druid-K8s-extension will use the Kubernetes API server for your node discovery and leader-election. What it needs to enable this extension? you need to add common.runtime.properties, a few clients and your Kubernetes service account; will need to have permissions on pods and config map. So, you might need to create a role and role binding. Personally, I haven’t tested this on scale, but in my POC and stage environments I’ve tested this extension. And it seems pretty useful. There is this example in the operator repo, which is still a pull request but will merge soon. Those looking forward to test it, can always leverage it.
Apache Druid and the Druid operator at Rill. At Rill, we are heavily invested into Kubernetes and Kubernetes operators. We use an extended version of the Druid operator, it does pretty much of the heavy loading of running Druid clusters. The operator is cluster scoped, means that it watches all the namespaces and I have a Deny List to kind of seclude whatever namespaces, which I don’t want to use. So we have, Rill is a SaaS platform for Druid, multiple clusters with petabytes of data. Helm is used to template out the CR, so all our custom resources are templated out with Helm and once applied the operator is responsible for managing the state of the CR and reconciling those. If a configuration goes wrong, we can always use Helm to do a rollback and the operator will again manage the state. Operator metrics, at Rill, we emit to Datadog; but there is a support coming for operator metrics on for Prometheus. So there is a /metrics endpoint on the operator, which will give you default operator metrics, which can… about how the worker key performance is, all the Go GC stuff. So, that can be used for monitoring as of now.
So, I just pinned out these learnings in the past few years, I have experienced out. StatefulSets takeaways.
- On deletion of a StatefulSet, PVCs don’t get cleaned up. This is something which I’ve been talking about and the operator will make sure that those PVCs are cleaned up by using the deleteOrphanPvc flag.
- PVCs can be expanded directly by editing / patching the PVC object.
- If I want to scale my PVC from 30 to 50 Gi, it can be done; but there is a prerequisite to it, your Storage Class must support volume expansion. If your Storage Classes are not supporting volume expansion, PVCs can’t be expanded.
- If using local SSDs on Kubernetes, Cluster Autoscaler won’t support scaling of the nodes. So, we had tried out local SSDs at certain points, but can not scale them. This is a pinpoint to be noted when anyone is trying to use local SSDs.
- Cascade deletion of StatefulSet, I use it pretty often. Whenever you do a cascade deletion, pods are orphaned and pods will still continue to serve requests; it’s just that you need to recreate that StatefulSet.
- Deleting a PVC referenced by a pod directly will not be deleted and the finalizers will come into play. I had previously just talked about that whenever you can try to delete a PVC, the finalizers will block it. It will go into terminating step but it will not delete. The StatefulSet will need to be deleted first. So, that’s what the Druid operator finalizers do.
- StatefulSets do ordered pod termination. If I’m running 10 replicas of a StatefulSet, they will be named as my_pod_name-0123. And when I’m doing the rolling upgrade, that means the StatefulSets will be ruled out in a fashion from 10 – 9 – 8 – 7- 6. They’ll go in this order. So, it’s an ordered pod termination.
- Whenever you try to update a StatefulSet like if you want to update labels for StatefulSet, the Kubernetes API server will throw out an error. Apart from Pod Management policy and podSpec, the same StatefulSet cannot be updated. So, you need to do a workaround to get that particular StatefulSet updated.
So, these are the key takeaways from the StatefulSets. And this is my presentation. Over to you, Bart.
Bart Farrell 43:02
Very good, very complete. Considering the technical challenges, they’re very well laid out in different parts. Now, a couple of things that I want to know, just for starters, because we always like to do a little bit of comparing and contrasting. We look at the use of Apache Spark, for example, we see that being used in probably more cases than we see with Apache Druid. Why is that and what is it that Apache Druid perhaps offers in those more niche use cases that Spark doesn’t?
Adheip Singh 43:31
Yes. This is more of a Big Data and Data Engineering question. So whatever, in my knowledge, I’ll try to answer it.
Bart Farrell 43:41
Of course, just based on your experience.
Adheip Singh 43:44
Yes. Spark is batch processing, whereas Druid is more of real time and batch both. So, you can ingest data through Spark in Druid; that’s kind of a pipeline which can be built. Regarding Druid operator, it was recently released, people are still on the way of trying to use / adopt it. So, hopefully this will kind of push more Druid deployments onto Kubernetes, if that was your question.
Bart Farrell 44:25
No, that’s great. Just because we’re talking about Pulsar versus Kafka – you got like the Kafka people, the Pulsar people and they started fighting each other and disagree. No, but I think a lot of it just comes down to – what is it that you’re looking to do, what are your objectives, and based on that, what’s going to be the best way to line up? In terms of the developments that you’ve seen as someone who’s been working with Kubernetes, since you said version 1.3. In terms of the changes that you’ve seen the evolution / development / growth, what do you feel like Kubernetes is becoming friendlier to data or that’s still going to be a major challenge for the next few years?
Adheip Singh 45:09
That’s an interesting question. When we say, bringing Big Data to Kubernetes, that’s the ideal sentence to portray. I’ll still say, the Kubebuilder and the controller-runtime have pretty much matured in a way to develop operators. And with all the scaffolding coming in with the manager and adding schemes. Initially, how controllers were written was you had to literally write down enqueue and dequeue functions on the WorkerQueue and then your informers and shared informers used to get hooked in, then event handlers used to be. So, all that logic is now put inside controller-runtime and it’s pretty much easy to build operators and do all the custom logic that you want to build. Specifically for Big Data, I still feel that there are certain things which can be improved – like local SSDs don’t support Cluster Autoscaler, some things to do with ephemeral volumes, and there are a lot of things which still can be improved. But with the maturity of Kubebuilder and controller-runtime, things are becoming much more easier to build operators so that we can just focus more on our business logic, which is specifically running Apache Druid on Kubernetes rather than hooking in informers, WorkerQueues and handling all those Kubernetes specific things.
Bart Farrell 46:45
Very good. We kind of touched on this in different areas, but one of the things we ask a lot of speakers, we try to ask all of our speakers is… and particularly someone that’s seen the full range of challenges that we look at one thing is – what is running data on Kubernetes, how do you run data on Kubernetes. But then the other question is ‘why do it?’. There are different reasons. One of the ones is to have one stack for everything, kind of like one stack to rule them all, to not have things in different places. Some will also argue for different reasons – the security of certain kinds of data. That’s for financial or health care or government; there are different ways of looking at it. But going back to the question: Is Kubernetes ready for running stateful workloads? And if not, what are the reasons that are stopping that? Is it a talent / cost / technology question? What do you think is the primary thing that’s stopping more organizations from running stateful workloads on Kubernetes?
Adheip Singh 47:44
Of course, I will be a bit biased towards Kubernetes, before answering this. Kubernetes is matured in a way that most of the security concerns can be taken care of, if you have proper security context and all your network policies managed in a particular way. What I feel can be… A pain is that, though, we are kind of building operators, controllers to simplify certain things, but they also add a level of complexity. It’s like ‘I just learned Helm and now I need to even use operator and I need to package the operator with Helm’. It then builds and builds over it. But at the end of the day, if you try to see what we are benefiting from such things is a huge gain. And I believe that running on the Big Data pipelines on Kubernetes is a bit complex, but it in the long run is easier once you try to understand that. Literally, scaling is one of the best things which I’ve seen. You can just do a replica set to three to four, and the pods come up pretty much fast and they’re able to serve requests. The same things if you see on VMs might take a bit of time. So, those are the things which I feel… And that’s why, plus at Rill, we are 100% on Kubernetes. We don’t want to have some stacks running on VM, some Kubernetes. Plus, yes, the talent and the costs definitely add on it. I will say it’s a niche domain. And it’s more of a depth than a breath domain when it comes to writing controllers and operators.
Bart Farrell 49:48
Nothing goes wrong. Those are all very good points. And as you said, for some folks, it might be a little bit challenging right now. You’ve really got to be thinking this in a long term way. Considering that, for example, with what you mentioned about StatefulSets, is that StatefulSets are now about five years old in terms of when they arrived on the scene for Kubernetes. I think those are in 2015 – 2016. And, there’s some debate as to whether or not they’ve achieved the impact that they were designed to have. I’m just thinking about the future. Do you imagine that they’re going to be more managed data services specifically for Kubernetes? What kind of features do you think we might be able to anticipate based on the kind of things that are going on right now?
Adheip Singh 50:32
I haven’t looked into where the Kubernetes StatefulSets development or the key APIs are going into, but something which I definitely like to have is handling of PersistentVolumeClaims in a much more easier way – doing cascade deletions and hacking our way towards getting things upgraded. Of course, Kubernetes has a certain… if StatefulSets are not allowing us to do certain things that’s there because of reason, it’s not like it was just built out. But the point is, for our use cases or some edge use cases, Druid needs to perform those tasks. And they don’t have any effect on the application. That’s where you want to make sure that things go in a certain fashion. So, that’s it from, I guess…
Bart Farrell 51:30
That’s good. Once again, we’re not gonna quote you on this and say ‘Well, Adheip told us, this is what it’s gonna happen’. No, we’re not here to do that. But the thing is, we just want to get these conversations out in the open. And also as a practitioner, I want to know in your particular case as someone who is we can kind of say, has probably learned how to do this the hard way, what have been things that have been helpful for you along the way to be able to navigate the challenges of running Big Data pipelines on Kubernetes.
Adheip Singh 51:58
If anybody wants to start understanding this domain, I’ll say, go to the sample-controller repo on Kubernetes, which will help you get started on how controllers work; not jump directly towards the SDKs because there are certain times when you really need to know that how the informers and the queues are working together. So yes, that’s pretty much helpful. Running Big Data on Kubernetes is a very fun way to… it has its own charm of running these things. There are a lot of operators, try to just start looking into the operators code base. Spark has its own, Strimzi for Kafka is there, Banzai Cloud operators are there, see the druid operator try to get into the community. And you can easily get started into the data on Kubernetes and good grasp of basics of StatefulSets and how Kubernetes creates pods, deployments and how you can extend Kubernetes. All the things, which I explained should be good to start through.
Bart Farrell 53:14
Follow Adheip’s advice! It’s very simple, the easiest. This is meetup number 91; I had to check because sometimes we cancel one, we have to jump ahead and we come back and we do another one. And I’ve never seen slides that are so clear regarding all the different steps. So I really think this is a wonderful gift for different folks. Because a lot of times it’s like – trial and error, but you really put things in order. And so, I really appreciate that because we have different kinds of practitioners in our community; some people have more experience than others, some people learn in different ways. But I think this is really, really well laid out. That being said, can I get you to stop sharing your screen really quickly, so I can share mine?
Adheip Singh 53:53
Bart Farrell 53:55
While we were talking, we have, as always, our partner in crime, who’s lurking in the background of the shadows. Let me know when you can see my screen.
Adheip Singh 54:04
Yes, I can.
Bart Farrell 54:05
He created this wonderful drawing while you were talking and trying to cover all the different sorts of things and he even drew a Druid where it says ‘Druid Operator’ because we already converse in the background. So anyway, just a nice depiction for us as a way to remember all the different things that you touched on. There were a lot of different things that were covered here. I hope if anybody from Apache is watching this, you definitely need to hire Adheip as a trainer because seriously very well explained. And particularly for a lot of concepts that are overwhelming for a lot of folks, I really appreciate the way in which you explain things. So, that being said, if people want to check you out, follow you on social media or know more about the company where you’re working at and where they should go in particular
Adheip Singh 54:50
Yes, you can ping me on LinkedIn anytime.
Bart Farrell 54:53
Okay, keep that in mind. Hope to see you in person at some point, whether it’s in Spain, the United States, in Bengaluru – not Bangalore, say it correctly like local people do. There are 8 million people who live there. I looked it up it’s like I said, ‘What 8 million people?’. Maybe it’s more or less, it’s just an estimate. But learn how to say things the right way, a little bit goes the wrong way. Adheip, Shukria – how do I say that in Punjabi? Just so I can get that too.
Adheip Singh 55:20
Bart Farrell 55:22
Oh, Dhanavād. So it’s completely different. That’s cool. So, I learned that today. We all learned something. We learned a ton about Apache Druid, but it’s nice to end with some cultural notes as well. Thank you very much for your time today. I hope we will see you in the future. And folks, if you haven’t signed up for DoK day at KubeCon and DoK Students day, check that out. Contact Adheip on LinkedIn, see all the cool stuff they’re doing at Rill Data. You have a great day. Take care.
Adheip Singh 55:47
Thank you, Bart.
Bart Farrell 55:49
Cheers. Bye, everybody.