Kubernetes is turning into a preferred platform for real-time analytic apps that crunches billions of events per day and returns insights in seconds. In this talk, Robert Hodges, the CEO of Altinity introduces the standard analytic app design pattern of fast event streams coupled with low-latency data warehouses, using open source projects. He walks us through deploying the pipeline on Kubernetes from ingest to end-user access. Touching on the use of operators, scaling, monitoring, upgrade, security, and approaches to adding custom components. He shares concrete lessons about how to stand up low-latency analytics quickly on Kubernetes.


 

Robert Hodges  00:00

In this talk, I’ll discuss how to set up analytic pipelines to data warehouses using Kubernetes. My name is Robert Hodges, and I’m presenting this talk for KubeCon Data on Kubernetes Day 2022. I’d like to shout out to the DoK Community for making this talk possible. I’ve been working on databases since 1983. I’m backed up by database geeks from Altinity Engineering who have centuries of experience with databases and their applications. They were particularly specialists in analytic applications —  applications designed to scan large amounts of data and give answers quickly. As a company, we’re focused on analytic applications built on ClickHouse. We offer a cloud version of ClickHouse called Altinity.Cloud. Further, we’re also authors of the open-source Altinity Kubernetes Operator for ClickHouse. As far as we know, it’s the first Kubernetes operator developed for the data warehouse. Let me frame this talk by just discussing why we want real-time analytics and what purpose they serve. For example, an e-commerce site is a website where we sell goods and deliver them to customers. The standard solution to answering common site questions is building a data pipeline. If we build it right, we can answer questions like “How can I boost the rate of return by posting special offers for products as users walk through the site?”

Another important question that we need to solve or at least provide an answer to as quickly as possible is whether the transaction you just processed was fraudulent or not. To solve this, we’re going to extract the e-commerce transactions. Hence, as people buy things, we’re also going to extract web server event logs and feed them into an event stream which will push them over to a low latency data warehouse. From the data warehouse, we can then feed applications that show operational dashboards (to let people see the current state of the business) and forward information fraud detection applications that can answer the question we just posed a minute ago. 

Now, let’s build pipelines in Kubernetes. If we were going to process the e-commerce transactions and feed them into an operational dashboard, we might create a data pipeline with the following components. We have MySQL as the database, which is used to record e-commerce transactions. In other words, whenever somebody puts something in their basket or buys something, it will be committed as a transaction in MySQL. Those committed transactions are then extracted using a project called Debezium, which can change data captured from the MySQL bin log. Debezium turns around as it extracts each transaction it puts into Redpanda, which is an efficient and fast Kafka-compatible event stream. From Redpanda, the transactions flow into ClickHouse. It is an open-source SQL data warehouse with very low latency and can process large amounts of information. The data in ClickHouse is used to populate operational dashboards that business analysts can look at to find out the current state of the business. Basically, within a few seconds of what’s happening on the e-commerce website. 

Now, let’s proceed with how to build this pipeline. You’re going to deploy it one piece at a time. Starting with a linchpin, which is the data warehouse. We will go to our Kubernetes cluster, create a namespace called ClickHouse, and then run the following procedure. ClickHouse requires ZooKeeper to control a cluster. ZooKeeper keeps track of information on what needs to be replicated between different server replicas. We will run a script that sets it up that posts the deployment into Kubernetes. Once that’s done, we will run another script that will install the ClickHouse operator. The ClickHouse operator defines a custom resource definition that can be used to define ClickHouse clusters and installs a container that can process those custom resource definitions or CRDs, once they appear in Kubernetes. Finally, we will create a CRD and apply it to Kubernetes using Kubectl. The content of the CRD will get passed to the operator and turned into, which will then implement as a data warehouse in Kubernetes itself. 

Let’s look at the custom resource definition in case you haven’t seen one. As presented, this is an example for ClickHouse. It’s a relatively simple file that defines things such as the topology of the cluster, number of shards, and replicas where Zookeeper is located. Further, even though we can’t see it in the slide, additional sections define things like pods, the version of the server that we are running, the memory/CPU it requires, and the storage configuration. This may seem complicated, but when the operator processes this, it will allocate 60 to 70 actual resources out in Kubernetes. It’s an enormous simplification of setting up this database (CRD for the data warehouse). 

 

In the full pipeline, we can see ClickHouse, which is set up; we also notice that we went through similar and varying sets of steps. There are three different ways that pieces of the pipeline get set up. We can run a script, as we did for the Zookeeper deployment and ClickHouse operator. We can also run a Helm chart as we did for the Zookeeper deployment, ClickHouse, and MySQL operator. Finally, for anything with an operator, we will create a custom resource definition to define that component (be it MySQL, Redpanda, or ClickHouse) that will get processed then the component will get implemented in the Kubernetes. 

 

When building these pipelines, one important detail is how to integrate the event stream, which can often be processing millions of events per second, and how to get that data efficiently into your data warehouse. Until now, common approaches for ClickHouse have been to build custom applications that will grab the data, do application-specific transformations and stick it to ClickHouse. Another way is to have ClickHouse itself directly read from the Kafka-compatible event stream. It uses a ClickHouse Kafka Engine, which can take a Kafka topic, make it look like a table, and then it intervals which pull data from that topic. Then, it populates an actual table inside the ClickHouse. Now one may ask, what are the other options? When we want to have a lot of producers and consumers, a powerful model we can use is Kafka Connect. It is a protocol backed by open source libraries implemented by Debezium, among others. When Debezium grabs something out of the bin log, it turns it into a standard format defined by the Kafka Connect protocol. The format is then put into Redpanda, and it can be read out by another connector, which can interpret it. For example, the data types and records apply to the ClickHouse. This powerful technique gives you interoperability with a wide range of producers and consumers. 

One of the things that we’ve been working on at Altinity is to build an actual sink connector, as we call it for ClickHouse, which can move things from the Kafka Compatible Event Stream into ClickHouse. One final note, the current practice in the example that we began with, we created the entire pipeline inside Kubernetes. That’s not how people do this. If you go out to production systems, you’ll find that there’s a mix of managed services and Kubernetes. For instance, we could run MySQL on Amazon RDS. We could have Confluent controlling the Kafka Compatible Event and ClickHouse itself running in the Altinity. Cloud managed service. Anything else that doesn’t fit in these three (MySQL, Kafka Confluent, and Altinity. Cloud managed ClickHouse) can just go into a Kubernetes cluster, for example, Debezium, Grafana, and any custom applications. 

Interestingly enough, Kubernetes itself is increasingly used as a managed application. In this case, we use Amazon EKS, which runs the cluster, so we don’t have to bother with administration. If you’d like more information, check out the Altinity Sink Connector for ClickHouse. This is a new project that we’re developing. You will also find the full documentation for what I’ve described here and the procedure for creating the entire pipeline with all its details inside Kubernetes. 

Thank you so much for listening to this talk! If you have further questions about analytic databases, data pipelines, and Kubernetes, feel free to contact Altinity. Cloud or send an email directly at [email protected]

 

Bart Farrell  10:35

We talked a lot about operators. In the case of Altinity and your process of building a ClickHouse operator, what are some of the learnings you got from that? Also, for other folks that might be out there thinking about building an operator, what advice can you give them?

 

Robert Hodges  10:58

I’ll give you two lessons. One is it works. I did a previous talk with Bart and explained how I had this huge argument when I got my current job, which I didn’t think Kubernetes would work for data. However, it turns out that the operating model is awesome because you have these complex topologies, and you can reduce them to a single file. Then, it gets you a well-thought-out deployment inside Kubernetes that can do upgrades and rescaling. It’s an important lesson because if you have a database, you want to be using an operator to use it. If you’re planning to maintain databases, you want to write one that does what you want. 

 

A concrete operator and lesson important for us are that stateful sets are inadequate for managing data warehouses. The stateful set has a cool idea where you will have persistent names and a number of replicas that will be automatically maintained for you by Kubernetes. However, we had a debate at the beginning; should we have a stateful set for the entire cluster or even individual shards of the large clusters? The answer is no. In Kubernetes and, more particularly, data warehouses, you often need to do things like having nodes and different configurations operate at different versions. Stateful sets can’t handle this. This was a core learning for us and put us on this path where we have a stateful set per node that allows us to make the name predictable and have persistent storage attached to it. We don’t lean on that part of the Kubernetes model; instead, we build on top of the persistent, coherent naming and storage we get from the stateful set. 

 

Bart Farrell  13:01

What do you expect to be happening, and we’ll be talking about, let’s say, a year from now when it comes to these paradigms and patterns, such as operators running data on Kubernetes?

 

Robert Hodges  13:26

I think we’re going to see a lot. One of the things that I’m seeing, and I think people will be talking about more, is how they can use Kubernetes to get to multi-cloud operations. The important thing about data is that people want the ability to run databases where they need them. Those can be everywhere, from running up in the cloud to on-prem. This portability story remains to be not fully played out. As Kubernetes becomes more ubiquitous, people are going to start thinking, “Hey, I have a database, and it’s got an operator; I’d like to run it where I need to run it.” I hope this is a conversation that we begin to have to stop thinking in terms of things being closed services, but instead having things like Postgres, Redis, Cassandra, or ClickHouse, I can stand up provision, and manage in the environment, or your specific environment that I need; be that controlled by a vendor done in their VPC or in a VPC that I control directly. I think there will be an interesting conversation, and we’re doing our best to start and get it going. It’s also something I hope I hear more about in the next year.

 

Bart Farrell  14:46

You may want to check out the live stream that Robert did with us recently. We talked about things regarding interoperability and portability. These are terms you will be hearing more about when we do our in-person meetup in San Francisco in July. Thank you so much, Robert! Enjoy the rest of the Percona.