As part of DoK Day at KubeCon NA 2021, we brought together Western Digital Senior Director, Strategic Initiatives Rick Vasquez, Oracle Product Manager Mickey Boxell and PlanetScale Software Engineer and Vitess maintainer Deepthi Sigireddi to discuss what the future may hold for database providers, moderated by Alex Williams, founder of The New Stack. You can access the DoK Day at KubeCon NA 2021 talks here.

Putting stateful workloads into Kubernetes used to be a challenging task, made much easier with StatefulSets. Kelsey Hightower shared that the Google team had, against all odds, created a Kubernetes operator for Oracle database earlier this year. Panelist Rick Vasquez posits that mainstream database providers should not only create Kubernetes operators, but directly develop the database with Kubernetes awareness and deep integration are some of the points why we decided to organize this panel. Watch it below! 

Alex Williams: Hey everyone, it’s Alex Williams, I am so excited to be here for this discussion for Data on Kubernetes. And we’re really looking at databases on Kubernetes. It reminds me a lot of the discussions I often have with people about how do you think about Kubernetes overall, and it really gets to the fundamental issues, I think that a lot of people are facing and Kubernetes was built as a stateless environment. But almost everyone has stateful applications. And so what do you think about your storage? What do you think about it on your on-premise versus off-premise? What do you think about the databases themselves? If you’re thinking about microservices, which Kubernetes really is supposed to be there for? And so I am really excited for this discussion with my guests. I’d like to go around and have them introduce themselves. Let’s start with Rick Vasquez

Rick Vasquez: Hi, my name is Rick Vasquez. I’m a technologist. I’ve been in the data game for quite some time, whether that’s at software companies managing data infrastructure, whether that’s at database companies, creating new versions of databases, or whether that’s at hardware companies focusing on how we store data. So really, I’ve been pinned into this data world, and I love it. And Kubernetes is something that’s near and dear to my heart, I was, you know, I’ve come full circle. So I used to be very anti-Kubernetes, for anything except for statelessness. And now I’ve come full circle. And now I think I’m one of the biggest proponents of putting all of your data in Kubernetes.

Alex Williams: Interesting! Next we have Mickey Boxwell.

Mickey Boxell: Hey, my name is Mickey. I’m a product manager with the Cloud Native Services Group at Oracle. But Oracle, of course, is a company that’s known for a history with databases. I suppose that’s why I’m here. I’ve been working with Kubernetes for a few years now. And we’re starting to do more work on operators and other things that simplify the process for storing databases and other sorts of stateless information within Kubernetes clusters. And I’m excited to talk about some of that today.

Alex Williams: Awesome and next we have Deepthi.

Deepthi Sigireddi: Hi, I’m Deepthi Sigireddi, I work at PlanetScale as a software engineer, where I work on with Vitess, which is the CNCF graduated open source distributed database that works on top of MySQL. I am a maintainer and tech lead for Vitess, both in the open source community and at PlanetScale, where we are building a Database as a Service or what we like to call database for developers, which is running on top of Kubernetes, which is built using an operator. So this is very much in the space that we are in, and I have opinions on why we do that.

Alex Williams: Well, where do I start? Because I hear Rick talking about him coming full circle, I hear Mickey talking about his database background clearly with Oracle, and Deepthi, I’m curious about your opinions. Maybe we’ll start Deepthi, tell us about Vitess and what is it? And it’s an open source project that leads to questions about open source and Kubernetes overall. And how do you balance that in terms of the overall equation for thinking about customers who are often to build on proprietary architectures themselves? How do they marry these two issues?

Deepthi Sigireddi: Vitess itself has been around for over 10 years now. And it was open source from the beginning. It came out of YouTube’s problem of how to scale MySQL because they were running on MySQL, and they were seeing explosive growth and their data architecture at that time simply could not keep up. So I think having been open source Vitess, has benefited from multiple companies, multiple production installs running into different sorts of requirements, and contributing those pieces back into open source, which is something that absolutely would not have happened if it had stayed proprietary. The other thing that happened with Vitess very early on is that the YouTube team had to move with us from YouTube’s infrastructure into Google’s infrastructure, which meant that it had to run within Borg which was a precursor to Kubernetes. So Vitess actually was able to run on Kubernetes very early 2015 even before Kubernetes 1.0 and this was without persistent volumes, without many of the features that have been added to Kubernetes since then. So it has been possible to run a database on Kubernetes for a while, but it required quite a lot of work to make that actually happen.

Alex Williams: I would love to bring it back to you Rick because this probably brings to your journey and you know, around this circle and I expect in those 1.0 days (referring to the version of Kubernetes), you were not as keen about Kubernetes as you are now. Can you take us back to those days when you were first in Kubernetes immersion? What were you skeptical about? And maybe then what has won you over?

Rick Vasquez:  So I remember vividly going to a happy hour that we host here in Austin, with a bunch of technologists in town. And I’m getting into an argument about putting databases in Kubernetes. And I was very adamant that you should never put anything stateful ever in Kubernetes. Because why would you do that, right? It’s an immutable container, never change it if you’re going in and even changing configs outside of like the API, you’re doing it wrong. And Kubernetes is the stance that I used to have. And you know, we’ve come a long way.

Alex Williams: How’d you come to that stance?

Rick Vasquez: How did I come to that stance? Because if your container goes away, or you have to manage each one of those containers, like a VM (Virtual Machine), you’re not gaining anything from Kubernetes at all, right? So now, you have this container orchestration tool, that’s not orchestrating anything but a bunch of containers that are all different from each other. You’re not going to run Puppet and Chef and everything in a bunch of containers, that’s like defeating the purpose of having orchestration.

So everything should just be configured the same, every container should be the same. And that’s, I think, where the power of Kubernetes is. It doesn’t matter where I deploy it, it deploys the same way, whether that’s on my laptop, whether that’s in a dev or QA (Quality Assurance) environment, or whether that’s in production infrastructure. The main difference is how much resource allocation am I going to give that container. So that was a really big advantage that I saw early on with Kubernetes. 

But what I didn’t see is well, databases change literally all the time, right? That’s the entire point of a database, maybe the runtime environment living in a mutable state. But everything that it’s doing is changing data and the entire point of a database is to manage data changes. So the two concepts to me just were like oil and water. 

Until persistent volumes came around and until there was this big push in Kubernetes – to make persistence, a first-class citizen, I think this shouldn’t even be considered. And now, we’ve come so far from there to now where persistence is a first-class citizen.

And you should be thinking about how to get my data into Kubernetes, to leverage all of the consistencies that you can gain from a deployment methodology and pattern. One of the greatest things about Kubernetes is, it allows you to get away from your traditional SAN architectures or your network-attached storage and have centralized appliances that have all of your disk and data and you have to worry about tearing and everything else. 

Now, you can kind of have a disaggregated set of hardware resources that are either directly attached or over fiber, over channels, and have that orchestrated with some type of container system that is just for storage.

An exciting thing is that Kubernetes has almost an entire layer that you can develop, that is storage only, and has nothing to even do with the databases, just to make sure that your data is safe. It’s in multiple places, there are replicas available if you need to, you know, reschedule a pod, that data is always going to be available somewhere. And that’s something that’s happened in the last two years, within Kubernetes. I know we’ve been working towards it for a long time but now it’s here and it’s here to stay. OpenEBS is a leader in the space – you’ve got StorageOS, you’ve got Portworx, you’ve got a bot and all cache  – for a ridiculous amount of money during a pandemic. Because people are pushing to have, ‘Hey, I want to get away from just having to buy a storage box, then put a LUN out there and attach the VM that I’m running for database onto it.’ Now it’s ‘I don’t care where my database resides as long as my data is consistently performing’ and that’s a big unlocker.

Alex Williams: Network-attached storage and the appliances – that’s Oracle’s talk if you ask me. Mickey, it sounds like they were singing your tune ‘Wait for a second, you know, what are you doing here in this group horizontal scaling. Vertical is the way to go as far as Oracle is concerned.’ Somewhere wrong?

Mickey Boxell: Yes, that is definitely our wheelhouse. For what it’s worth, there are still plenty of folks running centralized, vertically scaled architectures and working in our Cloud-Native services group, I sometimes forget that myself. 

Now with that being said, I do want to talk about a few of the ways that we’ve actually embraced distributed computing and moved towards this Cloud-Native model. So I’ve been at Oracle for over seven years, and even before I got there, we had tools like GoldenGate, which is a real-time data mesh platform, and also real application clusters which allow you to run a single database across multiple servers that access shared storage. Those have been around since before companies were jockeying over whether to use Kubernetes or Mesos, as their container orchestrator. 

There are also some other tools like a multi-tenant, which is a container database, but not the Kubernetes kind of container but that has pluggable databases that make it easy to move between on-premise environments, cloud databases, and even to Kubernetes environments

More recently, there are newer innovations, like sharding databases, where you distribute segments of data across different systems that can be on-premises or in the cloud. And we even actually now offer sharded databases via Helm charts. So this is all to say that, as a company that may be traditionally wasn’t in this space, we are embracing containerization of applications, and tools like Kubernetes. Moving forward, we’re going to start offering a greater amount of support using operators, I think operators are very cool. 

Alongside things like PVCs and StatefulSets, I think operators allow customers to use more Stateful workloads more reasonably. 

We can acknowledge that the community API was not necessarily built with Stateful workloads originally in mind but the fact is if we can extend that API to control resources in a cluster that aren’t part of that API originally, or resources outside of a cluster, including other cloud services, or even including things like on-premise databases, I think that ends up being a really powerful tool to connect what people consider to be legacy systems to this new Cloud-Native Stateless approach.

I think that it’s really important for us to embrace this change. We’re seeing things like El Carro from Google, which is a great first step and it allows you to provision databases within the Kubernetes cluster. But we have an operator coming out that does a little bit more than just that – and I’m very excited about it. Because not only can you provision a very generic Oracle Enterprise Edition database, you can also provision the sharded databases that I was talking about and each shard is its pod and a cluster with its own PVC attached. You can also use it to do things like connecting to our autonomous database services, which is a fully managed database in the cloud, an alternative approach outside of Kubernetes, or even connect to things if you have a database deployed on-premises, in your data center, or even on your local machine. You can use Oracle REST Data Services to speak to that using an operator deployed to your Kubernetes cluster. So I think there are some cool opportunities for innovation that come along with adopting this approach.

Alex Williams: Deepthi, I want to come back to you because I know you have some opinions. Anything that you find striking, about what was just discussed? But I’m also curious about what were some of the changes you’ve had to make with Vitess / PlanetScale over time to the Kubernetes Environment ? What is your approach to operators? Maybe you could help us shine some light on yours in the past year and know how it reflects on your perspective on operators overall.

Deepthi Sigireddi: Vitess can be run in Kubernetes, even without an operator but an operator just makes the management part of it much easier. In order to run something like MySQL in Kubernetes, you need a few different elements. The first thing is that if the primary that is taking rights goes away, you need a way to quickly detect that and failover to a replica. Well, it goes without saying that you should be running in replicated mode-without that you simply cannot survive a pod restart. The second thing was, even though you’re running in a replicated mode, a synchronous replication doesn’t cut it either. Because if you’re running with asynchronous replication, then you will almost certainly lose data when there is a pod restart. So you need to be running in a mode where any changes that the primary took, but did not persist to this before falling down, have actually been communicated to at least one replica and for that, Vitess recommends semi-synchronous replication. The other thing that Vitess does is scheduled backups. So even though you may not need to use those backups, it is possible on a failure to restore from a backup and come back online. It should not be necessary if the failure detection and the failover work correctly. But it is there for you to rebuild the cluster if something catastrophic happens. So those are some of the elements that had to come together. At PlanetScale, there was a very robust discussion around whether we should be using persistent volumes, or whether we should be using local PVs or just local disk. And ultimately, we came down with the decision of using persistent volumes, just because the complexity of recovery with local disk is much higher, which means that there is more room for things going wrong. The strategy with local disks is you detect the failure, you failover to a replica and then you bring up another replica so that your replica count doesn’t go down. So you do have that intermediate state where you’re short by one replica. And depending on how many replicas you’re running that may or may not be tolerable. So those are some of the considerations that went into how we did the operator at PlanetScale and we built the operator using the operator SDK. So there’s operator SDK, there’s Kubebuilder, there are some tools, frameworks that make it easier to build operators so that you’re not starting from scratch. Though I must say that the reconcile loop model that Kubernetes controller runtime gives you is a little bit non-intuitive. So when things go wrong, it’s pretty hard to reason about which run of the reconcile loop, did things go wrong, and and why. So I think looking ahead to the future, it will actually be really nice if people are able to simplify how a stateful application like a database has to interact with the controller runtime, in order to provide the availability, durability, all those good things that databases give us.

Alex Williams: There’s a lot to think about there. And one of the things to think about is the operator keeps coming up and they are still pretty complicated for most to build out and to actually manage and run. I’m curious, one of the things Rick talked about with containers, orchestrators and like, when you’re talking about containers, orchestrators just seemed like they didn’t add anything to the game. And, with data, is there the need for you think, thinking through like the orchestration of the data? I mean, there are discussions, there are storage orchestrators that we’re starting to see and emerge. Does data fit into that as an orchestrator? I’m trying to balance his concept of orchestrators versus operators.

Rick Vasquez:  So I think there’s two ways to think about the world. So the orchestration of where I deploy my assets or my applications, whether that’s a database, the storage, all the various different components that are available to your systems. So we’re now disaggregating resources and we’re saying, “Hey, what’s the best way to play Tetris with my resources so that I can have databases that are far enough away from each other that if one fails, the other doesn’t fall over as well?” That’s, I think, what Kubernetes is masterful at. Then you have another layer above that. And this is where operators come in. Because Kubernetes can’t be aware of the way that MySQL works, the way that MongoDB works, the way that Postgres works, the way that Oracle works the way, you know, I can keep going for a long time. And that’s what operators really allow for within Kubernetes. In that landscape, you can have a controller that says, “Hey, I know a lot about MySQL, right? I know what happens with MySQL. And I also happen to know a lot about Kubernetes. So I can take those two concepts and join them together to make something that is a consistent experience.” And I think that’s the real value in operators. Now, I’ll go one step further with a hot take and say, if you’re running a database in Kubernetes, without an operator, I would say that that’s an unsafe mode of operation, because unless you have a 24/7 365, staffed DBA that’s always logged in, just watching the Matrix, the error logs. I mean, that’s what an operator really does. And so you have situations that come up, whether that means replication lag, whether that means PVC, one of them’s blown up, whether that means a pod got rescheduled? How do you stitch back that replication topology? And how do you stitch that back in the database flavor that you have chosen? And then how do you stitch it back into all the services that rely on that, because not all the services are relying on just the primary, right? Some of these are really heavily reliant on the read secondaries, and some of these followers that are available. And so I think the operator is really important to stitch what Kubernetes brings to the table from a disaggregated, resource allocation perspective and kind of the complexities of having a really distributed system that doesn’t need somebody doing all of the nitty gritty configuration. Every time something happens. Right? It’s very upsetting. 

Alex Williams: I want to go to Mickey, and then finish up with Deepthi because we only have a few minutes. So Mickey, what’s that new stitching? What’s that new stitching that you’re saying needs an operator?

Mickey Boxell: I won’t say that is a good question. We just talked about how there are so many different “keeping the lights on” activities for database administration and there is seriously no need to do that all on your own. You’re more than welcome to try it, you can run RMAN backups to make sure that you have a recent copy of your database, you can do your own restores, you can do all these different activities that are operationally taxing without adding any value to your business. On the other hand you can simply use an operator that is built by somebody who knows what they are doing. So I think it is absolutely important to embrace the operator framework and rely on having subject matter experts develop them for you in a standardized way. One thing I also want to talk about is that we have this strange vilification of people who use stateful applications in Kubernetes. A part of that is what Rick, earlier talked about relating to the fact that Kubernetes initially was not the best source for stateful application or just any sort of data persistence but now that we’ve solved a lot of those problems from a technology perspective, I think it’s really important to shift away from vilifying those types of workloads and instead embrace them. Obviously, I’m coming from a company where we have a lot of customers that are running what people refer to as legacy applications, sure, stateless might be the right approach for a lot of container-native, Greenfield projects and some Brownfields as well but we have to acknowledge that enterprises still run a lot of stateful applications and connect to services that are not running on Kubernetes. Not every development team has the bandwidth or the expertise to decompose and containerize their apps.

Alex Williams: Since you talked about vilification and brought up narratives that relate around villains and heroes, I got to ask you this question: who is the villain, here? Every story has a villain and a hero, who is the villain, and who is the hero?

Mickey Boxell: I would say, the hero is just following the CNCF policies of openness. It is about making sure, we acknowledge that every single type of workload has a seat at the table in the Kubernetes space. The villain (not to name names) is everybody who is contrary to that opinion. I don’t think that we should just exclude these stateless or stateful workloads because they are an artifact of the past, there are just so many people with legacy applications that want to use this container-native approach. Oracle has all these massive applications like e-business suite and PeopleSoft. I have had people come to me and say, “Mickey, how do I containerize this and use a CI/CD pipeline to deploy changes to my PeopleSoft?” and I am like, “Well, that is a question for a completely different panel”.

At the end of the day, I just want to say that we owe it to the community to both recognize that stateful applications are here to stay and also to make the onboarding process as simple as possible.

Alex Williams: And they are worth billions and billions of dollars so they’re not going anywhere.

Mickey Boxell: Exactly. 

Alex Williams: Deepthi, any last thoughts before we finish it up? Where do you see databases and data orchestrators going? What do you think the future of operators is going to be like?

Deepthi Sigireddi: If we look back at the past couple of years, almost every popular database now has a Kubernetes operator. So either the people who produce that database or some third party has figured out how to run the majority of popular databases in Kubernetes. Every one of them has done it. We are an operator, it’s still possible to do it without an operator but I have to agree with Rick and Mickey, that operators are the way to go. Maybe there will be some better way of doing it in the future but I do want to say that the people who are building the databases can probably do better in terms of providing better utilities and hooks to people who are writing the operators to manage failure modes. 

Alex Williams: So that they can do better in terms of managing failure modes, is that what you are saying?

Deepthi Sigireddi: Yes, failure detection and recovery.

Alex Williams: What are the kinds of failures we’re seeing? What can they do better?

Deepthi Sigireddi: Well, in a Kubernetes environment when a pod restarts it just looks just like the process is crashing.

Rick Vasquez: I have a really good example of this. While I was working on a cluster database solution at one company, we had an operator for it. Whenever the entire cluster went down, if you didn’t start the pods in the exact order, everything would crash again because it couldn’t get back to its correct state. All it needed was just a little bit of change to the database like, “Hey, I’m the first node up, make everybody elect off of me”. But neither the operator nor the database is smart enough to do that. Kubernetes has no awareness other than, “I am supposed to start these things”.

Deepthi Sigireddi: That’s really interesting because one of the things we did with ours is that components can come up in any order, they will discover each other and eventually settle into a stable state. It usually takes maybe a minute, but not ours.  

Alex Williams: There’s so much here to talk about. I want to thank everyone for participating. I’ve enjoyed the conversation with Rick, Mickey, and Deepthi. Thank you all, for your time, we’ve covered the topics from orchestrators, operators to mythic villains, and heroes to the next steps and iterations of databases and data on Kubernetes. 

Thank you all very much!