In this talk, OpsVerse co-founder Arul Jegadish explains why they decided to switch from using Jaeger storage backend to ClickHouse, and why they choose to run it on Kubernetes. He also shares tips and tricks to migrate and run it without problems.
Bart Farrell 00:00
All right, the stage is yours, my friend, go for it.
Arul Jegadish Francis 00:06
Good morning, everyone. I am Arul Jegadish from OpsVerse, I’m one of the co-founders of OpsVerse. We are a startup that offers an end-to-end DevOps tools platform. So today I’m going to talk about the release of a project we took up where we migrated some of our data workloads to ClickHouse. So the purpose of the discussion here, one of the components in our tools platform is a full-stack observability stack. And as part of the observability stack, one of the modules is distributed tracing. And for distributed tracing, we use the tool, Jaeger. And Jaeger has different storage options. To begin with, we use Cassandra as the storage option, and then later migrated to ClickHouse. So this talk here is about why we did that and how we did that. So first of all, a little bit about Jeager, if you heard about Jeager, it’s an open-source tool for managing distributed traces. It came out of Uber. And it’s actually part of CNCF now. It’s one of the CNCF incubated projects, it’s very, very stable and widely adopted. With Jaeger, if you have a distributed system with several microservices, you can get a full unified view of your system when a request goes through your system. So that’s about Jaeger.
And a little bit more about the architecture of Jaeger, you actually begin by instrumenting your applications to emit these traces. That’s usually nowadays done with OpenTelemetry libraries. And then you connect those traces, send them to the back end, and Jaeger processes them, and finally stores them in the DB, the Jaeger DB. The common storage options are Elasticsearch and Cassandra. In our case, we started off with Cassandra as the storage engine, we didn’t want to run a bunch of Elasticsearch clusters. Cassandra served us really well. Till we wanted to do more with the data, we wanted to do a lot of analytics on the trace data that came into our back end. So then we started looking at other options. A little bit about it again, Cassandra has some advantages and disadvantages. With Cassandra, the ingestion was really fast. We were using Kafka. The lag on Kafka was like really almost zero. So the ingestion was really good. But on the disadvantages side, we couldn’t do much analytics on the data. So, that was the main requirement for us. The other disadvantage is resource usage. With Cassandra with restrict to storage, the memory CPU was pretty much on the higher side. And we were running, running this at scale. So we were looking for a more optimized solution. So that’s when we actually looked at ClickHouse. With ClickHouse, the analytics capabilities were really good, we were able to really do a lot of analytics on the data that we collected with Jaeger. Plus some of the storage optimizations with ClickHouse were also really important for us. We were able to do a lot of analytics on the data. And some of the weaknesses around ClickHouse didn’t really affect us for this use case because the tracing data is pretty much immutable once you collect it, you’re gonna delete it only when you are like “oh the retention period”. So that was pretty much okay for us.
Now, the other aspects of ClickHouse are, sorry, another aspect of using ClickHouse. So Jaeger has a plugin built using the gRPC support to support different types of storage. So we basically use that. There is a plugin published by the community called Jaeger ClickHouse that actually supported the ClickHouse back-and-forth, Jaeger. So we were able to make use of that and like make Jaeger the right data to ClickHouse. At a high level some of the steps that we used. The first thing is to run ClickHouse on Kubernetes. So we use the Altinity ClickHouse operator. So the operator has a lot of like a lot of community support, it pretty much worked out of the box for us. So the Altinity guys have done a really good job in bringing out that operator. So if we are going to run ClickHouse on Kubernetes, that’s a great choice to charge to the app, we run a lot of production-grade work workloads using that operator. So in this case, the other aspect is we use Argo CD to run our entire platform. So you know, in this case, the steps are really about, making changes to the Argo CD conflicts, swapping Cassandra out and bringing ClickHouse in, and bringing ClickHouse centers, basically, through the ClickHouse operator, custom resources. And then it’s just about changing the Jaeger configuration to use ClickHouse as the backend. We also had production data in Cassandra, so we had to migrate that data to ClickHouse. So for that, we basically use our own custom script. This is something I can like talk about, in detail maybe offline, on how we actually accomplish that data migration. So some best practices and some things we noticed when we did this migration. First of all the ClickHouse operator, actually ends up creating a load balancer by default, which is something you might not want. Your ClickHouse is something you will access only from within your cluster most of the time. So that’s a setting you might want to change before you go to production with that. The other thing is you may want to create a read-only user because if you are going to visualize your ClickHouse data on some virtualization tool like Superset, or Grafana, having a read-only user helps. So you may want to change the configuration to create a read-only user. And similar to that some basic changes like changing the retention period, disk space and requests, and limits for resources. These are things that you may want to change. The other important thing here is we were using Argo CD. So Argo CD is also a controller. Now you have a ClickHouse operator, that is also controlling some resources. Often these two controllers get into a conflicting situation where both controllers will try to manage resources, especially PVCs and PV. So you may want to take care of that otherwise, there could be some unpleasant surprises where Argo might try to delete your PVC. So before you go to production, this is something you really need to take care of.
The next step is really about making sure you’re backing up your data and monitoring your instances. So for backup, again, there is a great open source tool, ClickHouse on backup, currently, we run it as a Kubernetes cron job. Interestingly, this is also becoming a custom resource as part of the ClickHouse operator, it makes it even easier to like create backup software, and ClickHouse instances. And secondly, you can have ClickHouse expose Prometheus metrics, and then you can like scrape it. And once you scrape the metrics, you can create dashboards and create alerts. So we have this done and you can also get it done fairly, fairly easily with these open-source tools. Once you have all this setup, sometimes, you can query your data and create really good dashboards. All you had to do is connect your ClickHouse to one of the visualization tools, Apache, superset, three, dash, and DB Grafana. These are all great choices. And all these tools have out-of-the-box connectors and data sources available to connect to ClickHouse. So you just have to connect that and then start writing queries and create really good dashboards. In our case, with respect to Jaeger, the Jaeger data model was pretty flat. It wasn’t really optimized for querying. So we used ClickHouse, materialized columns, to create a few more materialized columns, which made it easy for us to query the data as well as improved our query performance. I’m gonna skip the demo here. If you have like, if you want to, like, look at some demo, I’ll be like, happy to show that offline. Basically, what this allowed us to do is, it actually allowed us to give our customers a really good APM solution on top of open-source Jaeger. So open-source Jaeger is really good. It collects traces. And you can view your traces. But then there are no analytical capabilities. So we took the data to ClickHouse, and then we created an APM on top of that data. So that’s what we were able to do with adopters. I mean, that’s what we have to share here today. So if you have any questions, connect with us, connect with me on Twitter, find me on LinkedIn, or find me here. I’ll be happy to chat more about this topic. I would like to thank the organizers for giving us an opportunity to present here. Thank you.
Bart Farrell 10:45
Awesome. We do have time for a question. Do we have any questions from the audience? Anybody got a Question? Question? Question? Question? Let’s see. Well, with that, then we do have a question. Awesome. Give me an excuse to run, exercise is good. Thank you.
Thank you. So you kind of mentioned, you know, contention between the Argo CD operator and the ClickHouse Operator, can you talk a little bit about what exactly was going on?
Arul Jegadish Francis 11:09
Just could you please repeat the question?
You mentioned a little bit about the Argo CD operator contending with the ClickHouse operator. Can you talk a little bit about what you saw?
Arul Jegadish Francis 11:20
So what happens is, with the ClickHouse operator, the Kubernetes resource that you will create is a custom resource, called a select ClickHouse insulation, that’s the kind that you will create. So Argo CD will manage that, meaning Argo CD will take the configuration from Git, apply that to Kubernetes, and will actually manage that. And once that resource is applied to your Kubernetes cluster, then the ClickHouse operator running on that cluster will look at that resource and it will do a few things right. It’ll actually launch pods, it’ll create PVC, and all that. Now, what happens is, if you don’t set the right annotations and labels to be propagated down, Argo will start thinking that these are extraneous resources are things that there is a PVC, now here, this is not defined in Git. So I should go delete it. Right? If you have like auto pruning set and auto-heal setup, then Argo will actually try to really delete your PVC. So if you don’t take careful action that can happen. So before you go to production, ensure you tested and ensure Argo, and this operator is good. This actually happens with even other operator types, type setup with Argo, you have to ensure the right annotations and labels are propagated.
Okay, so with the pros and cons late that you had. So you identified some cons with Cassandra, right? And you went with ClickHouse. So did you explore any other options? Any other solutions that you had other than ClickHouse?
Arul Jegadish Francis 13:04
So a couple of other I mean, initially, there was even a time when we explore we thought about using S3 as our back end, but there was not a very good option. But we were thinking about saving costs and all that. Looking at S3 Jaeger supports you in S3 storage, but we didn’t go down that path. In analytics, it’s not the right solution for analytics. There is a druid DB is also another option. We didn’t do a detailed POC there. ClickHouse worked really well for us. The team also had prior experience running ClickHouse, so we actually went with it. Druid is another option that you can actually explore for analytics workload.