In this talk, we shared our journey in developing Kubernetes Operators to run MongoDB in critical production Kubernetes environments. This presentation offered a deep dive into the lessons learned from both building a database operator and learnings from our customers from running a database on Kubernetes. We explored the nuanced challenges and innovative solutions encountered while orchestrating MongoDB within Kubernetes’ dynamic environment. We covered the architectural decisions, performance optimization strategies, and the critical role of stateful workload management in Kubernetes.
Speakers:
- Rajdeep Das – Senior Software Engineer, MongoDB
- Sebastian Łaskawiec – Staff Software Engineer, MongoDB
Watch the Replay
Read the Transcript
Speaker 1: 00:01 All right. Good afternoon everyone. Thank you for coming to our talk today. Sebastian and I will be talking about a few of the lessons we have learned along the way in running a database, in our case, MongoDB on Kubernetes. We hope you can take some of the insights that we had to learn the hard way, the solutions you are building. A quick introduction for ourselves. I’m Raj Deep. I was an engineering lead of the team responsible for running and managing MongoDB on Kubernetes, and I’m here with my colleague Sebastian. I’ll let him introduce the himself.
Speaker 2: 00:33 Hi, my name is Sebastian was, I’m based in Poland. I’m a staff engineer working for MongoDB, focusing on the hosted and Atlas operators as well as on-prem deployments.
Speaker 1: 00:48 Alright, for folks who are not aware of what MongoDB is, MongoDB is basically a document database that offers you the flexibility and the scalability of storing and querying data. However you need. The same database supports different use cases like streams, vector, search and so on. Now you can take the server binary and can run it anywhere you want from your local machine to on a bunch of machines hosted somewhere. We internally broadly classify the deployment models into the following categories to meet the needs of different personnel of our customers. The first one is fully managed, where MongoDB offers you a managed database as a service experience where both the control plane and the data plane are managed by us. We don’t need to follow our users on public cloud infrastructure. Atlas is MongoDB, completely managed offering. The second model is something we call semi managed, where the data plane is on our customers data center, but the control plane, which we call as cloud manager, is managed and run by MongoDB. The third one, self-hosted is what we will deep dive into today where the customers run both the control plane and the data plane on their own data centers or private cloud. We basically provide the continuing images for our self-hosted control plane, which they can run on the data centers and manage the MongoDB deployments.
02:16 To zoom in further into self-hosted Kubernetes serves as the backbone for both the control and the data plane. To give some historical context. Prior to our cube operators, our users would manually install the control plane first, wired them up, and then configure and deploy the data plane. As you can imagine, this would be like a series of manual potentially error prone steps to manage a database at scale. This was an ideal use case to automate the deployment and management of the database using Kubernetes operators. For those who are not familiar with what operators are, it’s basically an active reconciliation loop running in your cluster that takes some disaster of the world and converges the current state of the world towards the desired state. We divide the deployment into two parts. The first one is the control plane, which comprises of the MongoDB Kubernetes operator and a software component called Ops Manager.
03:11 This is only needed for the enterprise deployments at a high level. The operator handles the Kubernetes side of things like scheduling the database pods, doing the upgrades and biting to volumes, and the auth manager handles the MongoDB side of things like performing monitoring backups and restore the data plane. No surprise here has the MongoDB server and a component we call the agent. The agent is basically a sidecar proxy, which runs along the MongoDB binary, which gets the instructions from the control plane and ensures that the MongoDB server is running with the same instructions that we have configured in the data plane. I will now hand it over to Sebastian to zoom in further into the design
Speaker 2: 03:57 Design Going forward with the MongoDB enterprise operator architecture, imagine a running ES cluster with MongoDB operator deployed in one of the name spaces. Let’s take the initial step by deploying an ops manager custom resource. Once the ops manager CR is discovered by the operator, it creates an app db. It is a backing database for ops manager application going to the app db pod. It is composed of two containers, a database, and an automation agent. What makes this setup interesting is the role of the agent. It is responsible for rendering the entire configuration static and version each time. We would like to change anything in the database. For example, create a new role or a new user. The operator creates a new version of the automation config, which is then served to the agents. The agents deploy all the changes into the database. Think of the agent SMI, straw of the Mongo B orchestra with full control over database from managing its lifecycle to orchestrating backup and restore processes as we move forward. The operator deploys an ops manager. It is a Java based application focused on the operational aspect of the MongoDB system. It is useful to think about Ops manager as an administrative panel for MongoDB databases that has additional responsibilities for backups, including all the housekeeping work monitoring, alerting, and supervising the running MongoDB clusters. Typically, a single instance of an ops manager is managing multiple MongoDB clusters.
05:58 Once ops manager is deployed, users typically start deploying MongoDB workloads with MongoDB custom resources. The agent processes download all necessary data from ops manager and kick Mongo D off. Once everything is running, we can expose the cluster to the outer consumption. In this example, we are using node port services, but using load balancers is yet another popular option. Raach previously talked quite a bit about the controlling data plane separation. In our example, we can draw a horizontal line and the yellow part of the slide represents the control plane, whereas the gray one, the data plane, so when it is thinking about the control and data plane separation useful, well turns out we have very creative customers who started using the operator in a way we haven’t thought about. Some customers built their own internal data platforms. They often expose a way to create MongoDB clusters on demand by their internal development teams.
07:12 This enables them to take the best out of two worlds centralized control plane with centralized reporting and creating clusters on demand. This example made us realize a few things. Ops manager and MongoDB CS cannot directly relate to each other. The reference between them needs to be supplied in a standard form such as config map or a secret. The Mongo, the VCRs need quite often to be pre-populated to point to the centralized control plane. This is where the opinionated templates come into play and helm is just one tool to enable that. The MongoDB clusters may reside in different physical locations. As long as there’s connectivity between the data plane and control plane, everything should work out of the box. Finally, the control plane doesn’t need to be deployed on top of Cube. Some of the customers choose VM deployments for it. Now let’s have a look at some of the pitfalls with fallen end-to-end designing the operator over to you, Raj.
Speaker 1: 08:26 All right. Running a stateful application like database on Kubernetes. Two of the areas where we had to be more thoughtful while evaluating the trade-offs and the design decisions were around storage and networking. On the storage side, we have tried to be as agnostic as possible of the underlying storage that our users provision for the database. We do provide some defaults, but we let users configure most of the parameters related to storage. Like users can leverage CAS topology to co-locate the database pods alongside the storage. They can override persistent settings for MongoDB journals and OP logs. All this is possible via the custom resource where they can provide the override settings. Lastly, we would like to call out this particular upstream issue around supporting volume resize via the stateful set. Currently, you would need to patch the volume off of the stateful set pods and recreate the stateful set to resize the storage. We are in the process of automating it so that our users don’t need to perform these manual steps, but having a support for this in the upstream would be the ideal fix. On the networking side, basically we have two communications that are happening. One is between the database nodes to perform operations like quorum or replication, and the second is from the database clients or the drivers talking to the db.
09:50 If put disc communications are happening within the cluster, things are pretty straightforward and we leverage the headless service that we create for the MongoDB stateful set to enable it. Things get slightly interesting when we start looking into the DB drivers living outside the Kubernetes cluster.
10:07 We leverage something called the split horizon feature in the server. I won’t deep dive too much into it, but at a high level, think of it like the server has a notion of multiple address spaces, which means that within the cluster it can talk over the service of qion and for the outside cluster it can leverage the load balancer host name for the DB clients. It does this by maintaining a lookup table that the key is basically a horizon or the view and the values are basically the server addresses corresponding to that view. The advantage of this approach is that the internode communication still happens without leaving the cluster, so there’s a low network latency while the drivers are still talking over a load balancer. The second way is to override the addresses of each server with a load balancer hosting that we provision for each of the MongoDB nodes. The reason why someone would do this is because their cluster certificate authority might not be able to provision certificate for the local FQDN, which is service cluster local, so they need one central host name to serve traffic. As you would’ve guessed already. The downside of this approach is that the internal TB communication is also happening over a road balancer, which might entail some latency. Lastly, we are looking into leveraging service mesh options, which I will talk about a bit more in the next slide.
11:31 Alright, multi cluster is one of the topics that we are excited to talk about where we let users run the MongoDB deployments across multiple isolated Kubernetes clusters potentially spread across different regions. Currently, we only span the data plane across multiple Kubernetes clusters. We have this notion of central and member clusters. The central cluster is basically where the control plane runs and the member cluster is for the data plane. Having said that, if users want to optimize for resources, they can leverage the same cluster for both the control plane and the data plane.
12:10 As I briefly touched upon in the previous slide, we heavily recommend our users to leverage a service mesh solution to solve the problem of MongoDB server discoverability across clusters. However, users can use their own networking solution if need be. We have tried to make the design fairly agnostic of the networking solution that they choose. One of the capabilities I would like to briefly touch on is handling DR scenarios. Users can leverage the operator to perform the DR scenarios automatically where the operator does the health check for each of the member clusters and verifies if it’s down. If it’s failing health check, it tries to shuffle around nodes to healthy cluster based on some algorithm to distribute the workloads efficiently and evenly. For customers who don’t want to use the automatic DR feature, we also provide a CLI experience which users can leverage to perform the record manually by interacting with the customer source. I’ll now hand over to Sebastian to talk about few of the future challenges.
Speaker 2: 13:20 At the moment, we are working on a few very exciting projects that I wanted to share with you. As Raach mentioned, we view the hub cluster pattern for multi cluster deployments. Now we’d like to take this one step further and think how to make the control plane more highly available. We are exploring different possibilities including the operator failover or designing it in such a way so that it can be moved from cluster to cluster very quickly. Another very interesting topic that recently popped up is to allocate different resources for specific pods in a stateful set. In order to unlock the full potential of MongoDB, the primary needs to have faster discs because it performed rights and more memory. Because of query aggregation, the existing state full set implementation doesn’t provide such options, but we are investigating the leader worker set that could help us here.
14:25 Next one is external connectivity. MongoDB uses an intelligent driver that picks specific nodes in a replica set. For that reason, we require to deploy a load balancer in front of every asset member. We are experimenting with sacrificing a bit of performance and making this scenario a little bit more user-friendly. There are a number of options on the table, including our own Mongo S processes that aggregate data from multiple nodes and can act as routers, but we are also considering using Envoy proxy and teaching it certain aspects of the MongoDB wire protocols, so to route the traffic effectively. Finally, the gateway API will help us providing streamlined experience. Last but not least is telemetry. Anyone who tried to collect telemetry data knows that customers quite often disable it in production and gathering data from dev and staging clusters doesn’t really make sense. So we are experimenting with different approaches, but as always, we’d like to act as good citizens collect only data. We are allowed to anonymize sensitive information and ensure we have enough insights for our customers so that we could drive our business effectively. Having said that, thank you all for listening and we are ready to take the questions.
Speaker 3: 16:20 Thank you. Thank you for the presentation. Your mission about multi cluster and I saw in the diagram that you were showing, you mentioned that you need a dedicated centralized cluster, right? Can you explain exactly why?
Speaker 1: 16:40 Yeah, thank you for the question. So the reason we have a separation of the central and the member cluster is primarily because of the separation of concern. Like we want to dedicate a particular cluster for only the control plane and a particular set of clusters for the data plane. This is pretty similar to how the Kubernetes works in general, like you have the master nodes for the control plane and the regular nodes for your actual workloads. Having said that, as I mentioned, if you want to optimize for resources, you can choose the same cluster for both the control plane and the data plane. We don’t make it a hard requirement.
Speaker 2: 17:19 I think we even have some customers that run two clusters only and they squash everything into those two clusters. So there are many different options. You can deploy things, we don’t set any hard requirements, so it’s really up to the customer how to deploy it.
Speaker 4: 17:39 Hello. I wanted to ask you about the state for set because you briefly touched several topics that probably state for set for a bit for now. I just wanted to ask about your thoughts to migrate to something else, maybe to manage the pots on your own or you mentioned some other solutions. We do have a great example with cloud native PG that switched or the other Postgres operators typically use stateful sets up to some extent the cloudnative PG does this otherwise. So I was just wondering what are your options here and if you see any benefits switching away from state facet?
Speaker 5: 18:18 So
Speaker 2: 18:21 The answer, we don’t have any straightforward answer. So we started with stateful sets and later as we were developing the operator, we were, especially when we were designing the multi cluster capabilities, we were experimenting with writing custom controllers to did it. We even experimented with some controllers that were available back then, but it was a few years back. So the ecosystem wasn’t that rich as today. So for now, we notice very interesting proposal from Google about the leader worker set. It’s designed for machine learning. However, it enables us to set those different amount of resources and different amount of expectations from a very specific nodes like we have primary and we have also secondaries. So that’s one of the hopes that we have that our development could explore later on. But we are open to any suggestions. So if you have any good things that we could explore, we would be more than happy to take this feedback.
Speaker 4: 19:31 Okay, let’s discuss it. Thank you.
Speaker 6: 19:45 Hey, thanks for the presentation regarding the challenges that you shown in the last slide. Oh, sorry. So for the external connectivity, basically I guess Mongo is using TCP. What do you guys think about using L four lbs for instead of going through ingress or adding one more hop in between clients to the database?
Speaker 5: 20:13 Ra? Sorry, could
Speaker 2: 20:15 You please repeat the question?
Speaker 6: 20:17 So instead of using ingress or one more hop in between clients and the database, so what are your thoughts about using L four L base in between
Speaker 2: 20:30 L four? You mean when we use load? So what’s, maybe let’s explore what this scenario here. The customer is outside of the cluster and we have cluster inside of cube and we are trying to connect to it.
Speaker 6: 20:46 Yes. So basically we talk about database here. So it’s huge amount of data. When we add more hops in between, it might add few latencies. So to avoid that, do we have any ideas to reduce the number of hops in between? For example, cilium provides load brands at L four level.
Speaker 2: 21:09 Okay, so I think this problem could be split into two pieces. One of them is the scenario that the client application is outside of Kubernetes cluster and we need L four load balancers to connect to the MongoDB cluster because MongoDB wire protocol is built on top of TCP. It’s not HCTP protocol. So that’s why we need L four loaded balancer for it. Now we also explored the for simplification, we explored the scenario of having a replica set members connecting through a load balancers internally. So like the node replication primary to secondary. So we came up with an optimization. We patched the Q of DNS to avoid her pinning. That means, so did we tricked the DNS to return proper addresses in the cube cluster? It is definitely not a solution that is production ready. It probably satisfies only a few customers because of the complexity of this whole setup. But again, it depends whether we optimize for simplicity and then it’s probably fine or do we optimize for performance and then we definitely need to do something about this.
Speaker 1: 22:35 Just something to add to it, since we mentioned in the slide that we try to keep our deployment as agnostic of the underly infrastructure and since you mentioned cilium, we do have few of our users who have gotten both single cluster and multi cluster running with using cilium as the CNI and also the service mesh. So a lot of the solutions we might not be aware of, but our customers have already have it running in their production usage.
Speaker 7: 23:05 Okay, thank you.
Speaker 2: 23:18 Okay, thank you again. One more again,
Speaker 7: 23:23 I’ll approach
Speaker 2: 23:23 Them. It’s time. One more? Yes, I’m sorry.
Speaker 7: 23:27 Hello. Thank you. When spreading the data plane across multiple clusters in across region replication model, are there certain latency thresholds after which things fall apart or is that not known definitively?
Speaker 1: 23:45 So if I get a question correctly, you’re like, have we reached the limit where the cross region replication for multi cluster, we have hit some bottlenecks around that. Yes. Yeah, so that’s a great question and so far we have recently GA this feature and we have some network latency requirements from our customers, which basically they have to adhere to get multi cluster running. But we are very much aware that at some point the network latency would probably impose a limitation on the way the replication works. There could be some data drift, but we don’t have a good answer for that yet and we are still investigating how to either load that latency or basically hack on the Mongo wire protocol. Okay, thank you again. Thank you.
Data on Kubernetes Community resources
- Check out our Meetup page to catch an upcoming event
- Engage with us on Slack
- Find DoK resources
- Read DoK reports
- Become a community sponsor