Singapore-based property search engine startup 99.co provides renters, buyers, and agents fast and efficient property search experience. At the core of their platform used by 20 million users/month across Southeast Asia is their Elasticsearch cluster. The engineering team decided to move it to Kubernetes after being frustrated with the lack of agility of their former infrastructure causing outages. Learn about the why, how, and learning of running a production Elasticsearch cluster on Kubernetes.

This talk was given by Bybit Senior Backend Engineer Gregorius Marco as part of DoK Day at KubeCon NA 2021, watch it below. You can access the other talks here.

Gregorius Marco: Hi, everyone, good afternoon to those in LA or the US. My name is Gregorious Margo. In this talk, I am going to share my experience of migrating and running Elasticsearch on Kubernetes. So let’s start with who I am. I am an Indonesian but based in Singapore. Currently, I am working as a senior backend engineer at Bybit. It is a cryptocurrency trading platform. And if you do any derivative crypto trading, you may have heard of it. Previously, I was working as a senior DevOps / Back-end engineer at 99.co. 99.co is a Singapore-based startup. It is also the largest property portal currently in Singapore and Indonesia. This is where I did the migration of Elasticsearch on Kubernetes during my last days there (at 99.co). Let me just introduce a little bit more about 99.co. So this is the homepage of 99.co website. It is like Zillow in the US where you have property buyers and renters and you can search for their homes. When you search, you have a feature to search by nearest MRT stations or in this case subway stations. You can also choose multiple MRT stations or you can search by the street area, town or school, or the travel time from a certain location. After you have selected your filters this is how the “search results” page looks like. For example, if you search for listings around the “orchard area” which is the downtown area in Singapore, you can see on the left is the list of the listings and on the right is where they are located in the map. It is pretty intuitive and very user-friendly. For 99.co this is the most critical page in the entire website. Most of the users will land here. This feature is the “bread and butter” of 99.co. To support this we need a search service. We are running on Microservices and the search service itself is supported by Elasticsearch. We need to have a very reliable and stable infrastructure to run our Elasticsearch. So, let me briefly introduce you to the search service, the application side.

On the right-hand side, you can see that there are two kinds of processes: the “search service web” and the “search service indexer”. So these two containers are running in the Google Kubernetes Engine. What the ‘’search service web’’ does is it provides an API to query Elasticsearch so that the users can search for their respective listings. During its peak hour, it can serve up to 300 QPS, be it from the users or other microservices. The search service indexer’s main purpose is to index documents to Elasticsearch. Now, if you notice closely you will see that the search service is running in Kubernetes, but the Elasticsearch itself was previously running on Docker containers in separate virtual machines in Google Compute Engine.

So, what is the issue with this kind of architecture? There are some issues but the one that I want to highlight is performing a manual kind of rolling upgrade was a risky operation. So what is a rolling upgrade? To give you an example, let’s say you want to increase the RAM or CPU of the machines that host the ES (Elasticsearch), then you have to perform a rolling upgrade to minimize the downtime of your Elasticsearch cluster. If you want to do an upgrade of the ES version you need to do a rolling upgrade. In theory, what big steps you need to do, is to shut down your Elasticsearch node which is the container, then you have to shut down the machine itself that hosts the ES, then you can do such changes as increasing the CPU. After that, you have to start it back up and wait for the node to rejoin the cluster, and then everything is normal as it was in the beginning. Let me share with you a real story that happened with me. This is my personal experience when performing a rolling upgrade and there was a major problem that happened there. At that time, I had to increase our ES machine’s CPU. So, I decided to do it at 3 a.m. because if anything bad were to happen, at least the impact will not be that high. After all, there are fewer users that would search for homes at 3 a.m. So, back then we had three Elasticsearch nodes on three different VMs (Virtual Machines). I started by shutting down the container ‘machine A’. What I should have done then was to go to the GCP console and shut down the machine for ‘machine A’ but what ended up happening was I shut down ‘machine B’. So now we only have one ES node running and this was a very bad situation. I got a lot of slack alerts coming in about the availability of the search service itself and when I checked the Elasticsearch cluster health it turned red where it should be yellow. So, I had to do a packet attack and the “search results” page was done. I had to recover everything first to start up all the nodes and then make sure everything was running properly. What seemed like a very simple operation just to increase machine CPU became a literal nightmare for me at 3 a.m. The outage went for like thirty or so minutes, then I had to do the rolling upgrade properly and after that, I was being extra cautious. This is when the move to Kubernetes motivated us. Before we jump into how we move to Kubernetes let me just outline how you can do a manual rolling upgrade. These are the documented steps by elastic themselves. First of all, you just SSH inside ES machines then the next step here is you should curl these two endpoints to your ES cluster to disable shard allocations and perform Synced flush. Then you will shut down the ES container, shut down the machine, perform the necessary upgrade, and start up the ES machine back up and the ES container back up. Then you call this endpoint on number eight to roll back the settings that you did on step two to re-enable the shutter location. Then you have to monitor if your cluster health is becoming green again before you repeat it for all the other machines. Of course, watching whatever it is, you hope that there’s no disaster. I had one major disaster. This is when we started planning to move to Kubernetes. 

So just like moving our microservices before from containerized applications to Kubernetes, we are motivated by these few reasons. First, we will ensure the SLA because the search feature is very important in our website. It brings less stress to the engineering team to manage the ES cluster. It also comes with automatic monitoring from the Kubernetes and dynamic storage provisioning. Just the good stuff by running in Kubernetes. 

To introduce ECK, ECK is short for Elastic Cloud on Kubernetes. It is a Kubernetes operator which is built on the operator pattern and is written by the elastic company, natively. Out of the box, it has the functionality that lets you do things like managing and monitoring multiple clusters, upgrading to new specs, searching new versions with ease, scaling your cluster capacity and changing your cluster configuration, dynamically scaling local storage, and scheduling backups. All of this just comes out of the box from ECK. As I mentioned, in this case, it is using the operator pattern. I will tell a little bit about what is the operator pattern and how it is being used in this case. Let me tell you a little bit about operator pattern and CRD (CustomResourceDefinition). Operator pattern is a part that continuously monitors the current state of your underlying custom resources. This custom resource is something that you have to deploy from the CRD definition. Operator pattern makes changes to reach your desired state. Side by side with the operator pattern you have CRD. CRD allows you to deploy custom resources to be utilized by the operator. This next slide will help you to visualize better what is a custom resource and operator. So we as a user, deploy two things. The first one is the controller or the operator pod itself. It is a custom resource. So for this ECK case, the elastic has provided us with the operator pod and image. So we just need to download this operator Yaml file, as well as the CRD files from the website and we just need to apply it once in our cluster. Then we just need to write one YAML file which is the custom resource file to describe our Elasticsearch cluster. In this case, the custom resource is analogous to the Elasticsearch YAML file itself. What happens here is that on number two, the controller will keep monitoring the custom resource which in this case, the elastic operator continuously monitors the Elasticsearch custom resource. So by monitoring it, it can manage the native Kubernetes resources like stateful sets, config map services, Boston, etc. 

It can automatically translate the custom fields from the custom resource and create these StatefulSets, Pods, Configmaps, Secrets, and all the stuff that you have to set up otherwise if you are not using this Operator Pattern. So this is very handy as the end-user doesn’t need to write so many YAML files just to get our application running. Now let’s move on to the actual migration itself. This is just the outline of the steps that I did for the migration which took around two months of work. It started with research, planning and then I needed to do some application code upgrade, which is the search service code itself. And then we just need to write one YAML file, which is Elasticsearch.yaml and that is the search manifest file itself. This is the easiest part of migration in the entire migration process because all the heavy lifting is done by the Operator Pattern and CRD’s (CustomResourceDefinitions).

Now we will have to index the new documents in the new Elasticsearch cluster in Kubernetes. Then we also have to do data discrepancies, check between the old cluster and the new cluster, which we already did, and then finally test it with the real traffic. So I wouldn’t go into detail about how you could run Elasticsearch itself on Kubernetes or how you could run ECK (Elastic Cloud on Kubernetes) because that is already well documented on the Elastic website and you can just follow it from there, it’s very easy to run your own Elasticsearch cluster. This diagram shows the current Search Service Architecture with ECK (Elastic Cloud on Kubernetes). So, as you can see at the bottom, we still have the Search Service Web and Indexer process and now they call the Elasticsearch Kubernetes Service object, and this service is exposed so that our microservices i.e the Search Service, in this case, can call. So then the request will go to each part of our Elasticsearch cluster. So, in node A and node B, it will host its own Elasticsearch Pod and these Pods are managed by the StatefulSets, which again is managed by the Elastics Operator and CRD’s (CustomResourceDefinitions). Each Pod also has its own PVC i.e Persistent Volume Claim, which is also automatically managed by the Operator Pattern, and then this PVC will claim its storage from each Persistent Volume which is also managed by the Operator. So, we can see how everything is managed by the elastic operator but we as a user can only see it as a black box. And we do not have to worry so much about all the configurations as long as we have to write our Elasticsearch manifest file properly. This is what happens when we automate rolling upgrades with ECK (Elastic Cloud on Kubernetes). So, back when I was at 99.co, this log appeared whenever we type something like kubectl apply -f your Elastic Search yml file”

So, you can see the first bold text, there’s a disabling shots location, which is exactly what I mentioned earlier when we want to do a manual rolling upgrade, then it will request for a sync flash although it’s getting ignored after that, it’s okay then it will do a deleting part for rolling upgrade. Then these locks will keep being replicated for all the pods that are currently running. So you can see this kind of automation is very useful. Especially if you have so many nodes, and you are running so many Elasticsearch nodes, if you were to do manual load and upgrade, that would be quite tedious and dangerous (being just my experience). By performing this automated rolling upgrade, we can always be very safe about managing our Elasticsearch cluster, because automation or robots are better than us. I’d like to end my talk with my take on running Elasticsearch on ECK (Elastic Cloud on Kubernetes). So, some of the pros here for me personally, it simplifies the Elasticsearch deployment a lot. So, as you can see, with the rolling upgrade example mentioned earlier, it will just automatically bring down one pod and then spin it up before spinning down another pod. So, this makes our lifestyles very highly available. And the next one, all the operational cluster administration tasks like creating your config maps, secrets, and whatever is needed to run the Elasticsearch are automated.

And then third of all, it is because it is natively supported by elastic, so I can trust them. And because no team can write this operator better than the company themselves who developed or built Elasticsearch. So, you should always trust the reliable and credible people that write the operator and on the other hand, if you see that this operator is written by someone that you may have never heard of before or doesn’t have any credibility, then you should take more precaution in that case. Then the next pro is, you can easily manage multiple clusters. So to spin up, you need another Elasticsearch cluster in your Kubernetes cluster and for that, you need to write another Elasticsearch.yaml file and just apply that and they will spin up the cluster for you. The operator will manage multiple Elasticsearch clusters at once and you can also see it and monitor it easily from your Kubectl command, because of the Elasticsearch CRD(Custom Resource Definitions).

The best part is, there are no more human-related errors, the kind that I have experienced before. So back then I didn’t even need to perform the updates to the Elasticsearch cluster, I can do it any time of the day because I know it will be safe for the rolling upgrade to happen. So that was all about the pros. 

Talking about the cons, there are two cons as well. First of all, because we are running stateful workloads, like Elasticsearch, or any kind of database that is new to me in Kubernetes, there’s a slightly steep learning curve for me to learn about operator pattern and CRD (Custom Resource Definitions) and how these two can manage stateful workloads that have a lot of logic in it, to run the system properly. But once I understand how the operator works and what it can do, it becomes very easy to appreciate it and it also becomes easier to understand what other operators do. Then another con is, a major initial workload because while migrating, you had to do all the other stuff, just like performing any technology migration.

Overall, we can see that the pros outweighs the cons. So even though we have all this major initial workload and stuff, it is better for the long run and the team as well. And it is also a very fun experience to explore this side of Kubernetes and to run stateful applications in it. 

Thanks, everyone for tuning in and listening to my talk.