Intro to Persistence in Kubernetes

Sep 09, 2021 by Diogenese

So you want want to run stateful containers? Check out a brief intro to persistence in Kubernetes. Kubernetes and persistent storage go together like oil and water. Kubernetes is inherently an ephemeral system and persistent storage by definition must survive. After this talk, you should have a clear understanding of how to get started on the path to successfully manage a persistent data storage solution on your Kubernetes cluster

Watch the presentation below given by Eric Zietlow, Developer Advocate, and Kunal Kushwaha, CNCF Ambassador, Developer Advocate at Civo, and Student PM at DoK.

Bart Farrell:

What’s up everybody, we are here for a very extra special Data on Kubernetes session that’s gonna be an Intro to persistence talking about storage. Give you all some background groundwork under our belts to get started with this because we understand that if working on Kubernetes is a challenge, working with data on Kubernetes brings an extra special set of challenges. But that’s why we have two amazing guests here. I’m going to dedicate our wrap before we get started up so we can get the right amount of energy. Remember, you can get your questions and comments going in the YouTube chat. We’ll be happy to address them there. I was just talking to speakers as well. If for whatever reason we don’t get to address some of those questions here in the session today. We can continue the conversation on Slack and get those answered later this week. So let’s start it with a little acapella rap.

Bart Farrell (Rap):

You can call us Robin Hood because we’re taking down the share about to go into storage with Kunal Eric clerical errors aside, we bring in persistence to the game and take the speakers up to 11 loud and clear with volume claims. So keep the questions flowing like now tigers in Ganges MC awards go down. We’re clearing up the Oscars and Grammys. Gotta start out the right way. As I said, we have two amazing speakers with us today. We have Eric Zieltow & Kunal Kushwaha, both very experienced practitioners, both very experienced content creators as well, you can check out Kunal’’s YouTube channel. Eric’s got plenty of stuff, too, on YouTube talking specifically about this topic. But just to introduce both of you really quickly, Eric, who are you? And what are we going to be talking about today?

Eric Zietlow:

I generally will call myself a tech enthusiast. Someone who’s been in the industry for a while here. But in Kubernetes, distributed systems for a number of years. Before Kubernetes, I was actually working on distributed databases. So again, data persistence was kind of my life. Apache Cassandra, for anyone who’s heard of it, I was working with that for many years. I do a lot of or a little bit of a lot of things as well. But being a software engineer, being a network engineer, and supporting, you know, you name it, I’ve probably done it.

Bart Farell:

And Kunal, just for folks who may not know who you are, although I feel like it’s increasingly unlikely, just give us a refresher.

Kunal Kushwaha:

It’s very, very likely. Hey, everyone, I am Kunal Kushwaha. I am a CNCF Ambassador like Bart mentioned, I do content creation.

Currently doing developer advocacy stuff at Civo Cloud. We have the CNCF Student community, which is what this session is, in collaboration with. So we have some nice, you know, folks in the community, and we do cloud native sessions and webinars just like this one. I do have a YouTube channel. It’s called Kunal Kushwaha. So there I post tutorials and workshops. And yeah, I love communities & open source. And that’s pretty much about me.

Bart Farrell:

Very, very good. Eric Zietlow, just to get a little bit of context, what makes this topic of data on Kubernetes so tricky for folks as an entry point, what are some resources you might recommend? What’s the best way to get started with this?

Eric Zietlow:

So the first thing from a resources perspective is Kubernetes Docs. I know it’s kind of cliche, but they’re actually really good. Well filled out Doc’s that talk about pretty much everything we’re going to talk about today, all the different components, we’re going to kind of build a chain of, of infrastructure, if you will. And all of those components can be found in the docs, I can actually provide links, as well as when we’re talking about them. Why is it important? This is the direction the world is moving, there’s definitely been a trend towards containerization. And now there’s a trend towards stateful containerization, which I don’t want to spoil anything for later. But this is really the way things are going.

Bart Farrell:

And Kunal, in your case, how do you get started with Kubernetes? And then after that, you know, one of the things that frequently comes up, and I’m sure it’ll come up in Eric’s presentation, too, is we’re talking about the difference between stateful and stateless workloads. Could you maybe touch on that? First of all, how did you get into Kubernetes? And then like I said, as an entry point into Data on Kubernetes what these two terms mean?

Kunal Kushwaha:

I am a student and I’ll be graduating next year.

With Kubernetes I started, I was looking for some Java projects to contribute to. So I found this Kubernetes Java client, and it was pretty interesting by Red Hat middleware that contributed to that. That’s how I learned about Kubernetes.

I agree with the point Eric made. The documentation of Kubernetes is one of the best, it offers in depth explanation, and I think it’s definitely beginner friendly as well because I started with that only as a beginner. Other resources I would recommend are, first of all, attend meetups like this. If you want to learn about data on Kubernetes, definitely stay in touch with the Data on Kubernetes community. Join the Slack channel and stuff. Speaking just about specific Kubernetes stuff, you will also find plenty of amazing YouTube courses, just google Kubernetes resources and Kubernetes tutorials, you will find plenty. Just the important thing is to get started. There is no best resource or anything. People say, hey can you suggest some best resource? I don’t think there is like any, some people like learning from documentation, some people like learning from videos and tutorials. So it’s totally up to what you like to do. Speaking of stateful, stateless applications, we can go on and on with that. But for those of you who are like, since it’s a student’s session, just starting out, I think you’re familiar with Kubernetes. So if you are familiar you might know, there are pods and containers. So containers’ basically, your application is running in an isolated environment, you can imagine it’s like a box in which your application is running on the host operating system, continuously running on virtual machines, you can scale containers, downgrade containers. One thing you may know is that by default, if there is any data in your container, and your container gets corrupted, or gets deleted, that data will be lost. Speaking more about it, you know, Eric will show you a nice demo, he has a nice white board and everything. The basic thing that I’m saying is, let’s say your application is running in a container or box and that container dies or gets deleted. Let’s say you had a to-do list application, and you had a few tasks listed, those will be lost, the container will not store the state.

So, the stateless applications basically mean that, It’s like an application and like a process in which the state does not get saved. So basically, you have sessions of your application. Many people might be making requests, it does not depend on previous sessions or the reference information about all the operations, it does not save that. On the stateful applications, on the other hand, it uses a database, for example, to store some data, or in simple terms, store some state from the client request, or some in use that particular request to make further requests. So, just in brief, the session information when it’s stored, like on the server, for example, is a stateful application. In stateless applications, no such application is stored. Every single operation you’re making and things like that. And even in containers, if containers die, you can definitely start new containers without any trouble. That’s one of the advantages. So for example, if you’re talking about stateless v/s stateful applications, like Bart was mentioning, so in stateless, it does not require your server to save the information about the sessions or states . In stateful, it does require to save some information related to the session. In stateless, another comparison we can make is the design, since it’s not storing any information, the architecture is simple as compared to stateful applications.

Stateless applications since they’re not dependent on databases, then the crashes when they happen, containers go down. You can just spin up new containers, new instances. So, crash handling is much easier as compared to stateful applications. In this case, something called high availability, the servers here are regarded as, they’re long living. We can also talk about architecture, so like scaling architecture, for example. So if you want to scale stateless applications, then it’s much easier as compared to your stateful applications. There are few challenges when we’re talking about stateful applications. So, for example resources, so many containers that you may have, they may have some resource allocation. You might say this container should only run let’s say this much amount of CPU or storage or memory. So, if you have such containers, a stateless application, which will be much suited, but when it comes to some stateful applications, then in that case, this can not be a very good thing, because there is now a risk of losing data of your customers and having unreliable performance for example. So, another point can be related to storage, so, every stateful application we mentioned previously will need some sort of a storage associated with it. So, like a file system or a cloud storage or whatever, that can be difficult to determine, what kind of backup do you need for this? And if you want to migrate from one cluster to another cluster, how do you migrate the database? And one more thing, such challenges exist because initially the applications were not designed like the stateful like came afterwards. Stateless applications are very popular because they’re very fast and portable and everything. So it removes the overhead of creating and using sessions, and it scales horizontally very well. You can create new instances and just consistency across applications. So that’s like, it makes it much more comfortable for maintenance and to work with it. To answer your question about stateful v/s stateless, which one would you use? It’s not a question. I think it totally depends on the use case. Know what your application does, if you’re maintaining the state or not. Eric can touch a lot more on that as well.

Eric Zietlow:

We’re gonna touch on a lot of that throughout the session here today. But yeah, really good overview. I can jump right in.

Bart Farrell:

Thank you very much Kunal, for your great outline.

Eric Zietlow:

So, real quick in the chat, just so I can get some idea of the level of the audience here, how many people have actually interacted with Kubernetes in production before? Can I just get a quick show of hands in the YouTube chat?

Let’s just start yelling out different components that make up Kubernetes. So you all start with one, I’m going to say, the master is obviously an important component for Kubernetes.

A master is basically a computer. It’s a computer that is going to control all the other computers in Kubernetes. So we have a lot of intro level stuff here, it sounds good, we can work with that.

So in Kubernetes, Kubernetes cluster is just a grouping of computers that are all there to accomplish a job, you can think about it as basically a cluster with nodes in that cluster, and then a manager managing all of those nodes. And when we deploy a job, it basically sends that job to some compute resources. The jobs in Kubernetes are handled inside of pods and containers, these are terms we’ll define a little bit better. But essentially, those jobs get sent out to a computer where they’re run. And then if that pod doesn’t need to live anymore, we can kill the pod or if that pod is maybe stateful, like Kunal was kind of alluding to, then we have to do some stuff to make sure data sticks around. So first, we have a master. Next, we’re going to need some workers. So I’m just going to, for example, I’m just going to throw up two workers. And I’m just going to make two of these.

So we have our master, we have our workers. And generally what happens is the master, the workers all have sub components, things that help them to run. So the first thing on your master that you’re going to have is the API server. Okay, so API server, that’s how we, as users interact with Kubernetes. When you go into kubectl, that’s actually talking with the Kubernetes API server. A lot of the things you use that expand Kubernetes functionality will actually extend that API server and add additional commands you can run. There are also some HTTP clients to run against the API server. There’s a lot of really interesting ways to interact. The most common ones you’re going to see are through kubectl, which will talk to the API server for you. So you can actually run kubectl apply and then apply a YAML and do everything that way. The next piece is going to be your controller manager.

It is basically the thing that’s going to work on different jobs & alongside the scheduler.

And together, those will take the incoming jobs, the incoming requests for work, and they will send them out to the correct workers. They basically work together to make sure everything ends up in the right location. Next, we have this thing called Etcd. Etcd is kind of a secret sauce component here.

Etcd is just a key value store, it just keeps track of all the different components inside Kubernetes, how they’re interacting, what they’re if you tagged it, if they’re named, whatever those things are, so that you can reference them inside Kubernetes and other Kubernetes components can reference them, that’s actually really important. Because what we’re going to do here today is we’re going to basically make a chain of components that all reference the last component. So storage in Kubernetes, or persistence, the way we provide persistence, isn’t just as simple as plugging in a USB stick somewhere, there’s actually a kind of a path, we have to follow down to get where we need to go here. So Etcd stores all these relationships, tags, names, all these things we’re going to use to reference these different components. Next, we have kind of the star of the show in most cases, in Kubernetes, we have pods. There can be many pods per worker, there doesn’t have to be many pods for workers, it could just be one or two. But part of the advantage of Kubernetes is that we can divide things up via pods. Now pods are the smallest unit of distribution, what that means is, you’re not going to ever have a pod that spans multiple workers, a pod will live on a single worker, and will function on that worker in physical space. So if we think about this from a storage perspective, let’s say we have a hard drive and this hard drive is connected to this worker, it’s actually directly connected, the worker is just a computer. So this is the hard drive that’s in the computer and say if we want this pod to consume that hard drive, while this pod over here, would have trouble because it wouldn’t be connected, it would have no connection to that pod. So this is important to keep in mind, we’re gonna look at a local PV instance of how to do this, which is local persistent volume, which means we’re dealing with direct attached drives there, we can get more complex and start adding in network drives, SANs all sorts of things, we can take it very far. But for now, just to illustrate the simple concept, understand that we’re going to have hard drives that are attached to the workers. Next component, we have a kubelet. Kubelet, which is essentially your Kubernetes process that runs on every node, it’s the thing that when you look at a Kubernetes process on your node, you’d look for the kubelet.

Bart Farrell:

Eric, a quick question. One worker node can have multiple pods, right?

Eric Zietlow:

Absolutely, yes. And actually, depending on what your setup looks like, sometimes you’ll have many, many pods in a single node, it just depends. So when I’m just using the same example that Kunal used earlier, when you have a to-do app, say a really large scale to-do app, and you had a ton of people hitting it. And it’s a web based, you know, a web app, you’re gonna have a web UI, and you’re gonna need some sort of scaling to handle all the loads. So what you can do is say I have a small Kubernetes cluster, say three workers, one master. If I have three pods and each worker has one pod, but I need to scale this, what I do is I just deploy more pods, and then I could end up with, maybe nine pods on my three workers and it can scale out like that. So yeah, hopefully that answers the question. Pods are the smallest possible unit of deployment in one sense. We always hear about containers, containerization, Docker containers, Kubernetes containers and all the similar things right? Containers actually live inside of pods. So one pod can contain multiple containers. A pod that can actually contain one or many containers. Again, if you think about this, if you have containers in a pod, and you deploy them, they’re not going to end up on different workers, they’re going to end up on a single worker, because the pod is the smallest unit of compute distribution, smallest logical unit that you can deploy. Next, we’re going to take a look at the cAdvisor. So cAdvisor is basically local resource management for each of the workers. So when you have a worker, obviously, we need to understand how much compute is available, how much RAM and how much CPU, how many CPU cores are available on that worker. Now, when the job comes into the API server, and the Master says, “I need to put this job somewhere”, it doesn’t necessarily know all the information about what’s currently going on. But the cAdvisor is local to every single one of the workers and can basically report back how many resources are available, allowing the scheduler and the controller manager to make a smart decision and deploy that job where the load is lowest where the resources are available. So it is really important. And then the last component here is going to be the other kind of magic piece. And I say this as someone who’s worked as a network engineer, in the past, when you wanted to connect to computers, you’d want to set up a kind of a multi microservices architecture, you’d have to do a ton of networking to get there, you’d have to actually go physically plug in a bunch of switches, or if you had a, say, maybe an AWS cluster, you have to sit there in the console wiring the whole thing up setting up your groups, and putting all the resources in the right places. Well, Kubernetes does something really cool. You remember how I said at Etcd keeps track of everything. In Kubernetes, every single thing you interact with, create, or that’s just natively part of Kubernetes is named, it’s referenceable via that name, and kube-proxy takes care of the networking component, and obfuscates all of the actual IP addresses all the actual routing using at Etcd. So now I can reference a service or for example, if I have a load balancer, and I need to do something with that load balancer, I can do kubectl, and reference that load balancer by name and tell it to forward something somewhere or I can take a service and, you know, expose that service to the outside world and the kube proxy does all the heavy lifting for me. Really, really cool.

So next up, we are going to take a look at containers. So we talked a little bit about stateless vs stateful. What’s the problem with this setup If I were running a stateful workload? Can anyone tell me? For example, let’s say I’m running a database. And I’ve this pod right here that is running a container and that container has a database. What would be the problem with that situation?

So just like Kunal said, where if you had your to-do app, and you store a bunch of stuff in the to-do app, and suddenly you take that pod down or that container goes away for any reason whatsoever. Now you’ve lost your data, right? You don’t have the ability to reference anything that pod had because that pod doesn’t exist anymore. It’s ephemeral. As soon as it goes away, it’s gone. You could reapply the YAML used to create that pod, but that’s not gonna bring up the same pod. It’s gonna bring up a new instance of it, a brand new pod, basically from scratch, zero hour kind of just blank slate if you will. So what we need to do is to create persistence, for statefulness. So the first thing we need to do is what’s called a stateful set. Now, stateful sets are the key component of defining something as stateful. Basically, every time it is deployed with a stateful set, it gives it a unique identifier, and that pod doesn’t go away, just because you took it down, that unique identifier is actually stored along with a bunch of state information. And then when you bring that pod back up, it actually brings up that same exact pod again. So instead of just bringing up a new image off the same template, we actually bring up the same pod. But this still has a problem, because we can’t just store something in a stateful set and hope it magically persists, right, we actually have to have a place to store that. Now what I was talking about earlier, we have hard drives that are attached to workers, right, you need to run your OS from somewhere, you need to actually have direct attached storage somewhere in this stack, unless you’re in the cloud. But even with services like AWS, GCP, Azure, you can get direct attached drives on your instances, and actually, in many cases it is a very good idea to do so. So we have a stateful set that’s taking care of pod persistence, but it doesn’t take care of data persistence, data persistence needs to be handled a little differently. So the first thing we’re going to have to do to access the physical storage on any of our machines, we’re going to have to create what’s called a storage class. Now a storage class is a tool that Kubernetes defines, that essentially allows you to say, Here’s where my storage is. Storage classes are a Kubernetes object. So when you go and you’d say kubectl get sc that would give you a list of basically your storage classes, all the ones you’ve defined. Now, one of the complicated things, and this is why OpenEBS, Rancher Longhorn and some other solutions exist, is defining your storage in a non rigid way is very difficult. Either I have to know exactly what drive is connected, and wire it up directly to the pod that I want it to live on, or live say next to its container attached. I need to have a mechanism for doing that. OpenEBS and Longhorn, both are tools for doing that automatically. But we’re going to kind of simplify it here. I can talk about OpenEBS a little bit in depth at the end of this if people are curious, it’s an open source project that I’ve worked on for MayaData and who is the main contributor to it for a while, so I’m pretty familiar with it. But, there’s a number of nuances here. If you don’t have a tool like Longhorn or OpenEBS, you’re going to have to define everything manually. And as you can imagine, that’s a pretty brittle way of doing things. But it does get the job done in a pinch. So we create a storage class and that in that storage class, we reference our actual resources themselves. So next, we’re going to need a persistent volume claim. Now the storage class just creates basically a resource where these persistent volumes can live. Because what happens when we have a pod is we’re going to have a pod that’s going to attach to something called a persistent volume claim, or a PVC.

Okay, persistent volume claim is like a lease. And it’s actually a Kubernetes object that the actual lease itself is a Kubernetes object. So the pod is going to say I am connected to a persistent volume claim, generally, this is defined in the YAML. And the persistent volume claim is gonna say, I am a claim on a storage resource right storage resource that’s defined by my storage class. And that storage class is going to basically hand off a persistent volume.

So, that persistent volume is what is actually the Kubernetes object that’s basically wrapping your physical device. So let’s say DB right there, is going to connect to a persistent volume claim, which is going to connect to a persistent volume. And that persistent volume is going to be this hard drive right here from this storage class. Now there are ways to create pools of devices and then consume things from those pools. That’s one of the things that OpenEBS does is it will actually create a pool of devices. And then it has this whole leasing system, where you basically lease these devices, you can return them to the pools, and do a lot of things. But again, that’s maybe a little further than we’re going to get to here today. Actually, we might, we might actually have time for some OpenEBS magic, but what we’ve done here is we’ve now provided a stateful container. Remember, we have this stateful set up here that uniquely identifies it, we have the persistent volume and volume claim that are associated with that. And those particular components are stored within the stateful set as the information, the path to the physical device that’s going to store our info is now recorded in the stateful set so we can get back to it. So now I can take that pod, I could kill that pod. And as long as I can bring that pod up in the same place, or a place that has access to that persistent volume, I can get my data back. That’s really huge. So if I’m running a database, say I’m running, you know, say Postgres, and I’ve got the single node of Postgres & my app relies on this, and something happened and that pod went down. I can bring that pod back up, it reconnects that entire chain, the persistent volume, claim and persistent volume. And I am now back online. Now, you might notice I set a key there where it needs to have access, the pod needs to have access, because in just standard Kubernetes, there’s very little to prevent my pod here from maybe spinning up over here. And if it spun up over there, and this drive is directly connected to this over here, this worker might not have access to it right. There’ll be no actual path to get there. So that’s where things like operators, and there’s actually a lot of people who do Helm chart, which is basically a custom control loop sort of situation, to make sure these things come up in the places where they used to be. That’s getting to be much further down the road than we’re probably going to get here today. But for now, understand, if you bring up a pod, you set up your persistent volume claim attached to a persistent volume that’s attached to a storage class that has a physical device behind it right, then you can actually achieve persistence in a Kubernetes cluster. Alright, Bart did you have any questions here?

Bart Farrell:

No. So far, so good. And also folks are helping each other on the chat, which is also really nice. So keep up the great work.

Eric Zietlow:

Well, if you want, I can show you what OpenEBS would do in this sort of architecture, and use it to illustrate.

Bart Farrell:

We do have one question: If we delete a pod, doesn’t that keep spinning up again, and again, until we delete the deployment of that. So does the data keep getting refreshed again, and again?

Eric Zietlow:

That’s a really good catch. So in the situation where you’re running a stateful set, generally, you’re running with an operator. Now, an operator is basically a custom control loop. What that means is, if you’re actually accustomed to defining the behavior of those pods. So you’re entirely right. If we didn’t do anything, we were using Kubernetes control loop and just letting it go on. Yeah, the container would go down, we wouldn’t have a state, let’s just be clear on that. We wouldn’t have any state stores. But that pod would just basically bounce, right? And that’s not what we want. The fault tolerance of Kubernetes here is actually our enemy. What we want to do is we want to define a behavior where the pod will if it goes down, maybe we can have it spin back up automatically itself, make sure it’s the same container with you via a stateful set. Or maybe we don’t want it to just come back up. Maybe we want to be, you know, having to do some of this manually. It depends but that’s where custom control loops come in. And that’s actually a key component to this. But yes, good question.

So snapshots, backups, all of those things that are really good to do. And I will say this no matter what even if you’re running a distributed database, with redundancy, do backups. And the reason is quite simple. The drive here is still a physical piece of hardware, it can still fail, that drive could literally just spin off the platters one day, it happens. And that actually happens a lot. So even though I have this entire chain, I have all this containerization I’ve all this cool persistence that Kubernetes provides or could provide, depending on how we’ve configured it. I will still lose my data if that physical drive goes away. So do backups, create snapshots, do whatever you have to do to make sure that your data is safe? Backups are always a good habit. Absolutely. I can’t even think of an instance where I would tell you not to do backups. So, working with distributed databases, that was kind of a discussion that kept going around. Well you have all these copies of the data, don’t you want to do? Why would you need to do backups when you have these active systems that never go down. You can just bootstrap in a new node and never have downtime? Well, what happens when you have someone who goes in and manually deletes something, like a truncate table or something like that drop table, right? You’ll always need backups. There’s either human error, or there’s physical hardware issues that will arise at some point during the lifecycle of an application. Always do backups.

Let me go ahead here and I will show you guys what OpenEBS does. Now OpenEBS is a tool to create pools of storage volumes. Here is another question, what about consistency in the database? Will that somehow get impacted? We can talk about Cassandra at some point Cassandra is kind of like the king when it comes to distributed databases.

Consistency is actually a kind of a talk all of its own. It’s actually really cool. The short answer is no, there’s actually anti-entropy tools that will keep things consistent. It’s something called strong consistency as opposed to absolute consistency, but it’s more than adequate to handle any concerns. Why is there a PVC MPV? How does using a PVC help? So the PVC is the claim on the volume. The PV is the Kubernetes object, that is the volume. So if you think about it, if you’re walking into a shop, and you say to the shopkeeper, I want one of those candy bars over there. That is the persistent volume claim, the transaction that happens you’re giving them money in exchange for that object. And then you carrying that object out in your grocery bag is the claim of the object, the persistent volume is the candy bar. So basically, the pod has to lay claim that is referenceable checkable, you can actually do a description on a PVC and and follow the path of where that claim runs it. Think about it like a contract between the pod and the actual volume itself.

So, in OpenEBS, we’re actually extending the master. Is it how great companies have the servers all the time running as one crashes the other model automatically deploys? So if you ever are in a situation where you need 100% uptime, there’s kind of some methodologies and some thinking that you can employ to achieve that. And one of the ways was kind of the way Cassandra did, that was actually one of the problems etc, is trying to solve. And one of the things you can do in Kubernetes, with applications, is you create active applications. So if you think about it in the context of Kubernetes, right, we have many workers and these workers could be incredibly distributed. Like we could even run them across multiple providers if we really wanted to, if we really wanted to get crazy here. And so say, if we had the worst happen and maybe AWS or Azure or someone just crashed, like catastrophic failure, you know, kind of apocalyptic event. But one of our other providers was online, we still have instances that are reachable instances.

See our web front end, if that’s what we’re running is some sort of a web app or something, we can have instances of that running in the other provider and a load balancer that would know that the instances over here are down, those instances are up, and it would route you there. So it’s active, you’re actually able to interact with any and all instances at the same time, meaning that there’s no there’s no downtime. Now, key thing to point out, you have to set it up that way, in Kubernetes, you have to actually use the load balancer have multiple instances of your app, if you just have a single instance of whatever it is, you’re running, like I had a web UI and I just had a single pod that that just was running that web UI, and that went down, there’d be downtime. Because it takes time for that to redeploy. If I have a Kubernetes control loop, then go and just auto redeploy that because it’s unhealthy. And then the whole pro basically says, That’s unhealthy. Re-deploy, that takes time for that whole process to happen. So if you really want persistence, you’re going to want to deploy multiple copies of whatever this stateless application is. Now, when we’re working with stateful workloads, it’s a little different, right? Because we actually have data, we can’t just nuke something and spin it back up. So this is where Cassandra, there’s some other databases that do some similar-ish things. But I’m going to talk about Cassandra because it’s what I’m most an expert at, where basically, you have active nodes that handle consistency via quorum protocols. So basically, they get a 51% consensus agreement across a cluster, before they actually consider an operation successful. So we’re using that to maintain correctness of your data. But you also get the advantage of losing entire portions of your database, or access to entire portions of your database, and still be online. And you can actually replace individual nodes never going down. So similar concept, but the stateful versus stateless problem presents kind of a different set of challenges to each. Yeah, um, let’s see, you. Can you please explain persistent volume claims? Again, in brief, having a bit of confusion? Yeah, no worries. So a persistent volume claim here, I’ll use this, I’ll use this as an example. If you buy a house, you end up having a purchase agreement and a title to that house, those the purchase agreement and the title aren’t the physical house, they’re a record of ownership of that house. So you in this scenario would be the pod, right? You’re the thing that needs the house to live with. It is to live in, right. So the pod in this case, the person goes out, and they purchase the physical object, the physical object, the house is the persistent volume, and they get the claim, the deed, the title to that house is the claim that says they own the house. So someone wants to check on my, you know, who owns this particular piece of storage, we can check the claims and see which one I own. Or if you want to see what house I own, just go check my PVC, my persistent volume claim, and you can see which particular object, which particular volume, I have ownership of. That makes sense. A particular point is right access to a physical hard drive. We got a couple more minutes.

Bart Farrell:

I think keep going. I think if there are questions that we don’t get to, we can continue the conversation on Slack, so folks keep asking your questions, but then we’ll hang on to them. And like I said, if we need to continue the conversation there we can.

Eric Zietlow:

So again, looking at OpenEBS. So this is the OpenEBS control and it has a couple different components. It has an operator, which is basically a custom control loop. So it specifies a specific behavior different from the Kubernetes standard behavior of an object. So in this particular case, it’s providing some persistence of some things we don’t want to go away if we were to lose it, we won’t want to just go away and wouldn’t be able to bring up that same object again. We have the Maya API server. Maya API server extends the Kubernetes API server, allowing us to run kubectl commands that are specific to OpenEBS. So you can actually create OpenEBS resources using kubectl. It’s actually really important for any kind of ease of use standpoint. Next we have the LocalPV Provisioner, and LocalPV Provisioner is what actually goes out and provisions these persistent volumes, it takes that physical device and it creates this pool, that’s these Kubernetes resources. Because if you think about it, Kubernetes doesn’t really care if it’s a hard drive, if it’s an SSD drive, if it’s a SAN, it doesn’t know it, doesn’t care, it looks at this pool of resources. So LocalPV Provisioner is what actually does that work for you. On the note itself, we have this other key component, and we’re going to attach that to the kubelet. That’s called NDM, or Node Device Manager. And what that does, is that goes out and that will actually pull just like the C advisor goes in pools compute resources on that particular worker, the NDM maintains the local pool of storage resources on that particular cluster, and works with the local provisioner to create and maintain the pool of devices to be leased. So all of this stuff we had to do down here, instead of referencing something directly in our storage class. So instead of the hard drive being directly referenced, we just say our storage class and we define it as OpenEBS and OpenEBS does all of that work for us, maintains that pool, and when we have a persistent volume claim, we just reference that storage class which points it to OpenEBS, which points it to that pool of drives, and all of the leasing and releasing of drives happens from there automatically. To replace Etcd with Redis. I don’t think you can. I’m not 100% on that. I don’t know why you would. First off, Etcd is really lightweight. It’s really good at what it does. It’s kind of a very simple, durable key value store. I understand where you’re going with your progress is a key value store, can you actually kind of plug and play some of these components? I don’t think there would be any good reason ever, at least not nothing I can think of. Doesn’t mean I’ve thought of every eventuality, but I wouldn’t probably try it unless I had a really good reason. And I’m not aware of anyone, at least I’ve never heard of anyone doing it before. So yeah. Okay does that make sense, though, on OpenEBS, where it basically handles all of that drive discovery, all that pool management, that resource management for you, so you don’t have to direct wire these things, it makes it a little more durable. Because, again, we would have to directly reference the physical hardware from our storage classes if we didn’t do it with something like OpenEBS. Okay, hopefully that was helpful. All right. Well, with that, I think that’s about all I got for you guys today. So Bart, I’m gonna hand it off back to you. And we can go from there.

Bart Farrell:

Before we finish up, I think this is awesome, because a lot of a lot of folks that are in our community are, you know, struggling with this notion of like, okay, I’ve maybe touched a little bit of Kubernetes. I’ve done some stuff about microservices, I understand the differences. When we’re talking about different parts of architecture, maybe an intro to containers, some knowledge about Docker, but then we’re taking this jump to say like now we’re going to talk about data and Kubernetes and stateful data stateful workloads, things like that. That’s where things get a little bit blurry. So this is really helpful. Are there other things that you recommend we can leave resources aside for now. But in terms of the difference you know, things we’ve been talking about, we’ve been talking about, you know, clusters, pods and nodes, we’ve just mentioned etcd. We’ve also been obviously talking about PV and PVC, different things that are going to come up there. Other things that folks should keep in mind as they’re beginning on this journey that were helpful for you or that you find yourself questioning, you know, frequently asked questions on Eric’s site.

Eric Zietlow:

First thing I would throw out, and this is actually a resource thing. Go to YouTube often. There’s a number of really good YouTube channels out there on pretty much everything I’ve talked about here. I believe there’s one it’s called something along the lines of like TechWorld with Nana. Yeah, there’s NetworkChuck. There’s a lot of these people who are good at taking a beginner level approach to the content and just explaining it on a really, really, really informative, helpful, practical level there, you know, it’s really easy to talk about all this stuff in theoretical terms. But when you can get someone who actually sits down and takes a practical approach to how would you actually go about doing this? For me at least, that’s been one of the best resources I’ve found. The other thing, like I said, is the Kubernetes docs, and you know what, I’m actually gonna link, a couple docs, one on stateful sets, and we’re just gonna throw them in the chat here, storage classes, and then persistent volumes, which is also going to have the documentation for PVCs. So go ahead and check those out.

If you need the docs, those are the Kubernetes docs on each of these different components we talked about. Highly recommend you read Kubernetes documentation. I know documentation is sometimes very hard to consume. Kubernetes does a very good job at trying to keep it simple, keep it understandable and not use a ton of buzzwords.

Bart Farrell:

You would also mention, because you had mentioned Longhorn and we had a session with Saiyam Pathak about that, So if you’re interested in knowing more about that, specifically, you’ve got that ( link to the session ). The thing is, if you just type Eric’s name in, on YouTube, you’re gonna find plenty of other talks that might build on the concepts that were mentioned here today, with a bigger degree of complexity. I think, also a really good point, what he says is, and we mentioned this a lot, reading the documentation on it will give you the vocabulary will make you go research other things, so you get comfortable. And then what he was saying as well, too, is, I think the best way to be working on this and to try to put it into practice, which is why as a community, we’re starting to now build out projects. And we’re inviting folks to get this so that we can directly apply the concepts that we’re talking about. We have the videos, we’ve got the tutorials, we have the people with expertise, so let’s take those concepts and actually put them into action. So anyway, you’ll see more things about that coming out on Slack. Any other questions from the audience? Before we finish today? Anything else you’d like to know, career wise as well? You know, Eric, you’ve been in devrel. a lot. Is there anything else that you would give his general advice as someone who’s worked a lot in developer relations?

Eric Zietlow:

Yeah, I’d say find something you enjoy doing because it’s not a job. But that’s fairly generic advice. But I think it really holds true. I got into Developer Relations, because I had a passion for a specific group of things, that passion came out to the point where I was actually approached with this job opportunity. You know, a lot of people in devrel are here, because we’re passionate, we love what we do. We love technology, we think it’s cool, we actually get enthusiastic and fired up about these things. If you want to go to DevRel, it’s kind of a route, I would say find something that you can really truly be enthusiastic about. And even if you want to just be an engineer, or something a little more traditional, you know, the same holds true, you can do a much better job in life. If you’re excited to wake up on Monday morning and get to go play with cool stuff. I will throw it out just in case anyone wants to. It’s just a resource I threw together. I used this in a previous talk. So we had lab instances, but through my learning, I just threw the link in chat. They’re setting up OpenEBS, kind of the A to Z, just going down three node clusters, setting up an actual persistent volume, storing something on persistent volume. Pretty simple, overall lab, but check it out, feel free to use it not you know, it’s all it’s all up to you. And that’s sponsored by my data. That’s stuff that they’ve sponsored content wise. So yeah, feel free to check it out.

Bart Farrell:

Very, very good stuff. We got another question as well. People just ask about what exactly is Devrel? And this is good, too? Because? Is it marketing? Is it not marketing? Is it human resources? Is it just technical advice? How do you define it?

Eric Zietlow:

Not marketing, no marketing. What I am, is I’m an engineer who gets to talk to other engineers, and try to make them successful. I always make jokes. But it’s really true. I try not to know even how much the products I work with cost, because I just want to find the best solution regardless of what you know what those prices are. I love open source, DevRel is all about taking the passion and enthusiasm for technology out to other people and helping them to be successful with it. It’s kind of a newer, a newer role. It’s only been around for probably about five years in its current form about maybe eight in really any form whatsoever. It’s really just about helping people build communities interacting on technology, in teaching law, a lot, a lot of teaching in there. Looks like we’ve got another question. Here. The question here. Hi, Eric. You mentioned Cassandra is a database. That’s really good. Consistency requirement centers no SQL. What about SQL databases? Ah, well, there are a lot of good, consistent SQL databases. But the key difference here is distributed. SQL databases not so much. If we’re talking about actual active distribution in a database, you’re going to need to go with something with no SQL. And you’re going to, I would say, probably want to look at Cassandra as an option there. Yeah. Okay. What’s the difference between deverill, developer evangelist and developer advocate, well, developer, developer evangelist is marketing’s feeble attempt to become a developer advocate. Developer advocacy is DevRel . I consider myself a Developer Advocate under a DevRel umbrella. Yeah, okay, cool.

Bart Farrell:

Thanks, everyone, for attending. This is amazing. We’ll have the artwork coming out a little bit later, because our artists have another commitment today. But Eric is readily available in our Slack. If you want to reach out and ask questions, get some advice. I’m pretty sure we’re going to be asking you to come back, Eric, to give some kind of a part two, because this was really good. Oh, we got one more follow up question before we finish, what about MongoDB?

Eric Zietlow:

MongoDB is not necessarily distributed in the same way. It is. It’s kind of a document store with sharding. It’s not the best way to describe it. But think about it more in those terms in a say a true distributed active active database.

Bart Farrell:

Once again, thanks everyone for attending. This is very good, very interactive. Eric, you did an amazing job as always, as usual. So that being said, we’ll finish that up for this. This is our second meetup of the week. We’ll have plenty of work coming in. You’ll see all that on our website. If you haven’t subscribed yet on YouTube, please do so check us out on Twitter. Lots of news coming out soon about things related to coop con, we’ll be doing a co-located event there. Remember, we have all the instructions. If you want to write a summary about the session we did today. publish it on LinkedIn. Give Eric a shout out, give DoK a shout out. You know we’ll be choosing the best summary and getting that published on our blog. So if you want to get published on the DoK blog, feel free to check that out. It’s in our Slack. You can just DM me, and I’ll be happy to help you out with that. Eric, thanks a lot, man.

Absolutely. Take care & bye everyone!!

Data on Kubernetes Day Europe 2024 talks are now available for streaming!