Schedule now LIVE for DoK Day at KubeCon Paris | March 19, 2024

Register Now!

Building CI/CD Pipelines Using Kubernetes

According to Bob Ballantyne, traditional CI/CD pipelines are not suitable for native Kubernetes. K8s brings lots of desired features for the enterprise, however, migrations from static virtual machines bring lots of new challenges.

At Dell, within the Dell-ISG business unit, supporting over 11K+ engineers daily, across the various product types (SaaS, on-prem software, and embedded software), Bob was able to leverage K8S to help us scale their CI/CD pipelines. Kubernetes provided them with reliability and scaling, but the enterprise is bigger than just K8s. Out pipelines include apps like JIRA, GitHub, Confluence, Slack, and Testing.

Their pipelines require persistence, security, traceability, and robustness which goes beyond RBAC and horizontal scaling
In this talk, he explores the approach that we have taken to create a scalable, flexible, maintainable architecture to support the enormous size of the Dell-ISG enterprise and how they used K8S to do it!

This talk was given by Dell Distinguished Member of Technical Staff, DevOps Architect Bob Ballantyne as part of DoK Day at KubeCon NA 2021, watch it below. You can access the other talks here.

My name is Bob Ballantine. I’m a distinguished member of tech staff at Dell Technologies. My work in the ISG division, I’ve been there 15 years or so, I’ve been in DevOps for 10 plus years, working with Kubernetes for maybe six years. So, fair bit. The talk today is about some of the stuff that we’re currently doing to take the ISG enterprise to the next level. It’s all about serving the developers. But the interesting thing from the perspective of these conferences, we’re really using Kubernetes as the basis for all of this. And to do that has certain challenges as we’ll get into some of them. But we chose it because of the reliability, scalability and those sort of things I’m jumping in, these are the sort of items that we’re going to be covering today, it’s going to be a pipe talk. Not too technical, because there’s just too much for it. But let’s get going.

So, first thing is slightly controversial, statement to begin with, just to get you fired up. Traditional CI/CD pipelines are not suitable for native Kubernetes. […] Trouble, as far as Kubernetes is concerned, because it’s using resources, it’s sitting there all the time. Most of the time, it’s not doing much. So you’ve got to get ready for Kubernetes. controller doing that, how do you control your pipelines? Other things like GitHub actions, they don’t work quite as well. You’ve got GitHub pods, which do what well, but the combination is no plug in play Travis CI, CircleCI, you do it yourself, the good for developers, but it doesn’t allow you to really provide that course? What we’ve got, though, is on Kubernetes, great thing namespaces. We’re not well, no, no, but namespaces. But within the enterprise, they make a huge difference. Because we can have namespaces, for dev environments, for our staging environment, for our production environments. This is what we want, we want to keep all these environments separate. As you’ll see, we’ve got other uses for them. And then within these, we’ve got the different orchestration engines that we can run. That’s a problem. Because within the enterprise, we’re coming from an environment that was predominantly open source Jenkins, there’s hundreds and hundreds of Jenkins pipelines defined. If we want to be using Kubernetes efficiently. And we don’t want to use Jenkins open source. What do we do? Well, we had to come up with this notion of a pipeline abstraction. And I’ll get into that in just a second. But basically, the pipeline abstraction allows developers to describe what they want to do. Using our DSL, there’s just a YAML file. And then we’ll take that and translate it into whichever orchestration engine we want to use. That way we get out of the business of being tied to one being tied to one tool, we also get out of the business of having problems where we want to try a different tool and all these things. Enough about that, um, as I said that this talk is more about pipelines and how we deal with Kubernetes. The big question is what defines a pipeline? So to us, a pipeline is everything. If it’s automation that sub pipeline, pipelines connect one thing to another, whether it’s a developer to the main branch, memorize to production or developer to production, to customers, integration without parties, DevOps to Kubernetes, DevOps to orchestration, whenever anything has got to be deployed. It doesn’t matter where it’s good to be a pipeline. Um, did I say, if it’s not a pipeline, and it’s manual, that means it’s broken.

We’ve got to be using pipelines everywhere. I used to work in a refinery and that was pipelines everywhere. And I used to be thinking, what is this? I know I understand because pipelines are everywhere. And the way that we use them is, as you see in the screen, we use them to scale because everything We do get strapped to the pool hub, a GitHub pull request. That means that the GitHub branches represent our state. 

05:15

And between those branches, we have pull requests. And that effectively gives us a cookie-cutter mechanism you see over there, it forms a ladder. And the only difference between these cookie-cutter representations is the types of tests we’re doing or the types of functionality that’s there. So we can build up a framework there to fit within that pull request. The pull request, is a nice, easy place for developers to go. Not even developers, for anyone that’s got to authorize something, or review something, make sure that everything’s okay for the next step in the pipeline. We also have the same Artifactory, we’re tying the two of those together. Because these days, Artifactory from JFrog provides a lot of this stuff as well. So that’s another pipeline, we just tie it in the same way. And we have the functionality going together. So much for the generality of pipelines. Within our pipeline, as I said, everything is in the pipeline, absolutely everything. Because we want all the automation, it’s the only way an enterprise such as ours can scale to the size of 11,000, developers 60 product units, that sort of thing. So everything has to be connected, and it needs to be automated, if it’s not automated it is broken. Or the same thing is manual, which means it’s broken. To do that, we’ve got a couple of important things besides the Kubernetes and the cafe that you see here. We have fire hose patterns, that’s just a fancy name for how we use webhooks. To get the data into Kafka, we’ve got the pipeline abstractions that we talked about before. We use standard GitOps through Argo CD to take the results of GitHub, pull requests, and synchronize with Kubernetes. So these are the general pieces that we’re using to give this a cookie-cutter type environment. These are the tools that we’re doing it with. We standardized on these tools that were 10 times the number of tools we started with and cut it down so that we could understand the problems and start making the connections between the tools. Once we get the standard connections or standard pipelines established, then we might think about introducing other tools. But for just now the focus is all around getting the pipelines between all of these tools set up running so that nobody has to do anything odd. No one has to think about it. I should say there’s still work to be done. pipelines to pull requests to create etcetera. So what I did was, as I said, it’s a massive project we’ve been on for a year plus. And the priority that was given down to us was to not harm. That was of the utmost importance. Because think about it. Dell is a big company, we have a lot of products. 5060 is generating a lot of money, I might have no idea. I just know it’s a lot of money. If our project which is affecting all the developers were to crash and burn, that would be terrible for Dell that we cannot let that happen. That is just out of the question. There’s just no way, we have got to keep the boat going forward. We’ve got to keep that car running as we’re making the changes. So no harm was the number one. Number two, everyone’s aware of it. We can’t have ransomware attacks coming in and doing stuff that we’re working on. That just goes without saying. Next, we come to this for DevOps-type principles that we were going after when we were designing the system.

 

09:37

And those were traceability, reliability, repeatability and scalability. We may see some of those come with Kubernetes and that’s why we have chosen Kubernetes. That was the big thing, just the stability, the scalability, the ease with which you can stand up, hot spares and all of those sorts of things. We also went with Kafka. Previously we looked at tools like Active Directory, Artifactory, Jenkins, etc with the exception of GitHub. Kafka is also like all of these tools that can be stood up with Kubernetes

So now we’ve got this environment where, if we fail one site then we can move everything over to another site. We don’t need to have downtime, we can keep the developers working as they are working before, we’ve got the data that can migrate with the applications on minimal downtime. We have the data stored with everything, so that it just keeps moving forward. The bulk of our storage is actually in our Kafka topics. That’s because we use that for an event based system and we’ll be looking more into it in the future. So the next one was the primary directive from management, you’ve got to simplify everything for developers. And this is the sort of common mantra that we’re hearing about Kubernetes in general, these days. And it is going to be simple for people to use, If it’s not then the developer fill find another solution. And developers are one of the best at doing that. So those were the main goals that were set out, and as a consequence of all these, we found that we had to fight cultural change. Fortunately, we are having top management with us, so we were able to carry these ideas forward. Without that we would have been having an uphill battle. That cultural change is substantial within the size of a company like Dell, we’re not talking about all of them, but we’re talking about Dell Technologies, or ISG. But it is still substantial, because developers like to do things in their own way. But we managed to convince them with the talk about traceability using sources of truth with the use of Kafka because of the fact that it allows us to produce a reliable, scalable event based system where we can show them what’s going on. Having the traceability back to the sources of truth. That was something that the developers liked, because that allowed them to dig in and find out the problems when they arose. Not just the DevOps stuff that we’re responsible for, but also when they’re making code changes. For us, it allows us to debug everything in our pipelines, so this was the major aspect. As we’re moving forward, we’re getting into using Kubernetes, and this is what we’re hearing a lot about today. But we’re using it with Kafka, as I said. Kafka is all Dockerize so it is much easier to run it in Kubernetes, we can also use it for its persistent storage. We have the real time updates, but the biggest thing here is, Kafka produces the events and Kubernetes is based on events as well.

We have extended that beyond Kubernetes because of our use of Kafka. Because in the enterprise, we’ve got all this baggage that we’ve got to bring into Kubernetes. It’s not just going to happen overnight, we’ve still got a lot of groups, we’ve still got more than half the groups actually that have got to be brought in. They’re not going to come in easily, they’re going to take time, because a lot of them haven’t even Dockerized their builds or their tests. We’ve got to give them the tools to do that. We’ve got to get them into a Kubernetes development environment. Some of the teams are actually using Kubernetes but they’re running it from anywhere. 

14:34

So all of this makes it difficult. Kubernetes and Kafka helps solve this because with Kafka, we’ve got the events which mirror everything around us. We’ve got our object model that describes that with Kubernetes. As I said, there We’ve got the namespaces. I mentioned the production, the staging and the dev namespaces. We also have namespaces per PU(product unit), we can limit blast radius when something goes wrong, supposing someone bought something in the wrong way, rather than through the company mandated with the brought in and the laptop and something, we can contain that because it’s gonna Kubernetes. And we’ve limited the blast radius. We’ve also got all the checks on JFrog artifactory, the father of limits, so the combination is just wonderful. And that’s why we’ll get the sales per product unit. We also set up what configuration in similar fashions and we’ve got umpteen layers of configuration, per PU(product unit) per environment per application, you name it. And that’s another of the important things that we had to learn was how to deal with that configuration. As I mentioned earlier, at the tail end, when things into the main branches are getting deployed on Kubernetes. We’re now using Argo CDs. And that’s because we have a liking for GitOps. We use Argo CD just to synchronize Kubernetes. With what we’ve got in Git. Again, back to this source of truth, we have everything in Git. So as we use our pull requests and our automation in the pipelines to move things through branches, and GitHub, it gets to the final branch where we deploy to production. And that’s what Argo CD is doing. The last one in the slide is the abstract pipelines, which I’ve mentioned, so I won’t go into that. So some of the challenges we’ve faced, as I said, onboarding, 60 plus snowflakes, has not been an easy task. We’re not even 50% of the way through it yet. But just that variation is really tremendous task. Getting visibility into the different processes of different groups, has been eye opening. This is something that a lot of them didn’t have before. And it’s been eye opening for them. And for us. We’ve also been providing some of the common checks that they’ve been asking for that they just haven’t had the chance to add themselves. By the way, we’ve been adding it through the pipelines running in Kubernetes. This was how we started, this is what we’re using to give all these different organizations a taste of it. As we’re doing this as well. We’re here to architect the whole system to be this event based system. And we did that using a technique called Event Storming

17:58

For which I’ve got to thank one of my leaders as he introduced us to Event Storming. And it’s been a godsend because it allows us to draw pictures about how we want things to progress through Kubernetes, how we want our pipelines to be and where the functionality is, and how we want to break things down into microservices. It’s just like sliced bread that is one of the greatest things. 

So the takeaways from this talk are, As I said, we wanted to go with Kubernetes for scaling. We wanted Kafka because it’s an event based system that has storage and the scalability that’s got the highly available nature because of the cluster it has. And the way it took votes between the different brokers or zookeepers and the events of the messages the same way that we think, provides a communication mechanism. So through all these point-to-point communications that we hate because they’re forever breaking, Kafka increases the reliability of the entire system. The next thing that we did was wrapping everything in GitHub pull requests. That’s what gives us our cookie-cutter approach, the quality ladders, and the mechanism where people can review changes and allow them through slowly until they build up confidence. It’s also this pockets of automation in pipelines where we do certain jobs, but at some point, we’re going to get rid of these manual parts or we do get rid of them. Its just the normal flow for the organizations as they don’t trust the automation, so they want to put in the manual blocks, and we let them do until they get the confidence, and then we can take out that manual steps so that it just allows the pipeline to flow as though its water going through a normal pipe. So this gives us the stability, and the traceability in GitHub. The pipeline abstraction, as I mentioned before, enables us to transition from this massive amount of Jenkins pipelines, Scripted pipelines, and some freestyle jobs and all of that stuff. Also this abstraction allows us to describe that in a way that developers can onboard onto that, and can describe what they want to do. And then the DevOps side can choose which parts of the pipeline are to be implemented in Argo Workflows or GitHub Actions or Jenkins, etc. Initially, we’ll stick with one tool, but it gives us the freedom. And then lastly, I think I’ve said the word over and over, that, pipelines are everywhere. They really are, and it is the biggest takeaway from this discussion. 

And with that, I’d like to Thank you all very much for tuning in and listening to my talk. You can feel free to contact me on LinkedIn and on the Slack channel for the Kubecon event.