Scheduled Scaling with Dask and Argo workflows

Feb 15, 2022 by melissa

Complex computational workloads in Python are a common sight these days, especially in the context of processing large and complex datasets. Battle-hardened modules such as Numpy, Pandas, and Scikit-Learn can perform low-level tasks, while tools like Dask make it easy to parallelize these workloads across distributed computational environments.

In this talk, Severin Ryberg who is currently responsible for architecting and maintaining the Cloud and data strategy at ACCURE Battery Intelligence shares how Argo Workflows offers a Kubernetes-native solution to provisioning cloud resources in Kubernetes. Being Kubernetes-native, Argo Workflows also meshes nicely with other Kubernetes tools. This talk discusses the combination of these two worlds by showcasing a set-up for Argo-managed workflows which schedule and automatically scale-out Dask-powered data pipelines in Python.

Bart Farrell 00:00

I want to turn it over to our very accomplished guest, Severin Ryberg, who is no stranger to data pipelines. And today, we’re going to be hearing about Dask as well as Argo. We’ve heard a little bit about Argo before and its wonders. But today, we’re gonna be hearing a little bit more but before we get started, tell us about yourself?

Severin Ryberg 02:15

I was on an academic track, originally in the direction of physics and electrical engineering. Both of these degrees were at Texas Tech University, in Texas, US. And since then, I got into the direction of renewable energy analysis and first set up the National Renewable Energy Lab sometime in 2015. At the same time, I was also an adjunct professor at the Colorado School of Mines, in both of these places, basically doing C++ development, a developer/intern, and a professor. So it was quite a weird time in my life to be both of those things at once. Nevertheless, fun. After that, I wasn’t so done portraying myself academically wise. So I thought, why not go to Germany and do a PhD research in renewable energy integrations across Europe, also energy system analysis. And I was maintaining the infrastructure stack for the research group, roughly a group of about 70 or so people. That transitioned into a postdoc position, of course, where I wasn’t just maintaining the infrastructure, but architecting it because the research group was growing. But then I’m happy to say that since mid-2020, I have been the infrastructure architect at Accure battery intelligence, which brings us to today so that’s my roadmap.

Severin Ryberg 05:04
Accure is a battery intelligence company founded in mid-2020. So we’re still quite young. We’re closing in on about 50 employees, and we’re hiring in case you’re interested in, rather interested in batteries and data and all these different types of things. And basically, what we want to do is monitor batteries, compute their health, make sure things are safe, optimize, these types of topics, very heuristic algorithms, we have to write a little bit of ML, that sort of direction. Our USP is that, basically, not only coming from the background in this topic, we also are 100% software-driven. We want to be able to solve everything with data, without having to put any, any hardware in the battery, so our ultimate goal is to monitor every battery on the planet. And, our USP is we’re born in the cloud, so everything we do is in AWS or some other Kubernetes cluster, you will not find a workstation sitting in one of our closets. We have laptops, and that’s it. Some of us even work completely remotely. And the primary tools that we use are Python, AWS, and of course, Kubernetes. That’s why we’re here today. My particular role at Accure is I was the sole infrastructure developer for quite a long time, the infrastructure team has grown now to a size of about 10. So I’m doing a little bit more managerial work now. But I’ve touched everything under the sun in terms of infrastructure as a career in the last two years. So, that includes data engineering, traditional development, DevOps, we use GitLab, CI, and cloud engineering, as I mentioned in AWS, as well as Kubernetes, making sure our computation pipelines are schedulable and scalable. And that is it for my introduction.

The talk that we’re going to go into today is scheduled scaling with Dask and Argo workflows. And the goals for this presentation are to understand why Argo plus Dask was a good choice, for us specifically. And I then want to provide a rough overview of the infrastructure setup that we’ve used. And, to kind of give you all some perspective there. But then I want to deliver a little bit more into depth with a very basic example with Argo workflows and Dask, going over those things that I’ve set up for you today.

What problems were needing to be solved, what led us down this path? And even before Accure was founded, we knew several operational requirements that we knew that we would need to develop our solution around. So we knew on the one hand that we wanted to stay in Python, mostly because this aligned with our developers’ skill sets. And not only that, but some of the people on our team had worked with other tools in the past, notably Spark, and they weren’t so happy with it, let’s put it that way. And, that does not just spark itself, but also PySpark, specifically, we were not so interested in because of how it flips between Python and the JVM, related to PySpark, that we wanted to go beyond simple black-box types that are rarely needed. We noted that we needed to go beyond simple MapReduce type functions. And we needed to be able to scale any generic black-box type thing. In a lot of different environments, or other contexts. We also knew that we wanted to have a shared development experience across all of our development environments, everything from testing, to production to staging. We want these to be the code that a developer writes and tests in Sandbox, to be the same code that ends up traveling up to production. Once the test passes all the tests and these types of things. We wanted to avoid a situation where something has to be written specifically for the production, but then a developer can’t run that same thing locally, without some very complex setup. Something of a shared experience there. We also knew that we wanted to do all kinds of parallel processing, both simple batch processing, as well as shared parallel processing. And we also wanted to promote self-service for our data engineers. And by this, I mean, our data engineers, people who focus on battery algorithms, they know batteries, they’re pretty good at Python, with NumPy, with Pandas, but, they might not be the best with distributed computing, and they almost certainly aren’t going to know things like cloud computing, or Kubernetes. And instead of forcing everyone to become a generalist, we wanted to use a service or a tool that allows these data engineers to service themselves as much as possible, without having to get into all that depth.

Bart Farrell 11:00

Can I just stop you there really quickly as someone who has a background in education, and looking at things like, that as challenging as Kubernetes, when you’re approaching a group of engineers, and you need to get them onboarded, and you need to understand, what they know what they don’t know, what they’ve learned, what they need to unlearn? How do you sort of develop if we want to call it an internal curriculum to make that happen?

Severin Ryberg 11:22

That’s a hard one, honestly. Because everyone’s coming in with their own goals with what they’re wanting to learn. To be honest, I would say as much as possible. I’ve found a lot of success trying to make the tool match the developer. So that’s a topic I’m going to talk about in this presentation today. Where we took Dask, we took Argo, and we kind of added Accure specific layer on top of that, and then basically accomplished this task right here was this like shared development experience, and then basically, allowing the developer to work in the thing that they know, best, which is in this case, Python, Pandas, not having to think about all these other topics, and we just say, hey, make me a function, throw a decorator on that. You don’t have to know what that decorator is doing behind the hood. But as long as you have that decorator, you’re scalable. And then we found a lot of success with that, we have people who come in with very minimal experience. And, they can run very large pipelines fairly easily. So hopefully that answers your question.

Bart Farrell 12:35

No, that’s good. And the thing is, I don’t think there’s one, there’s a single answer to that question that is just curious what your approach is, in particular having been in academia as well.

Severin Ryberg 12:44

Sure, there are infrastructural and security requirements, and I think in the interest of time, I may try to go through a little bit faster. But basically, we knew that we wanted to have ultra-low latency parallelisation, we’re gonna have hundreds of 1000s of batteries to monitor and each battery have dozens of individual cells, maybe even hundreds of cells, which are each producing their time-series data, and they all need to be analyzed. If we have any sort of latency, in our parallelisation infrastructure at all, those seconds that it takes to spin up a Kubernetes pod, are going to be killer. So we knew that we needed to avoid that. We also wanted to have a multi-tenant environment, because we’re going to have multiple customers. And when I say customers here, I mean, the big OEM battery providers, not that Mom and Pop might have a battery in their garage, but rather, the company that Mom and Pop bought their battery from. We’re gonna have a few very big profile customers, we want to make sure that they don’t have any chance of conflicting with each other, rather, affecting each other. So, they should be as much as possible completely isolated, hence, multi-tenancy. And we want cost-efficient computations, automation in our deployments version, controlling high data throughput, especially in this case, because we have a lot of time-series data that needs to be reanalyzed every day when new data comes in, and then the other typical things that you would expect, secure access, dependable scheduling, logging and archiving. So you might ask yourself, why not just use this service for our pipeline scaling actually, this was the first thing that we thought to do ourselves. So we looked at things like Apache Airflow and services that also provide Apache Airflow and we weren’t super happy with the learning curve there note that this was also before Airflow 2.0 which means it also didn’t have the best Kubernetes support and, if we want to have a patchy airflow, we still need to maintain our Kubernetes cluster. So it’s not like we even avoid that complexity, we would still need to do that. There’s also a nice service called Prefect, for quite a while we were really happy with. It was the first one we were working with rather than the first one that we had signed on with. But at the beginning, they were kind of also an early-stage startup that was going through their growing pains. It ended up affecting us and note that this was in early January so that probably has changed at this time. Nevertheless, we had already decided to move on. And more importantly, they changed their cost model, which makes a lot of sense, probably for a lot of customers. But for us specifically, it drastically changed our price point. So it no longer was an economic choice for us. And then we also tried some AWS specific tools like Batch and Glue. But we found these to be not so flexible. And if we could, we prefer to stay cloud-agnostic. So in the end, there’s only one option, and that’s to fight that tape. Do it yourself. Alright, so let’s get into Kubernetes, Argo, and Dask and maybe even hopefully have a little bit of fun. So these are the requirements that I just listed, that we knew that we needed, and of course, we know that we want to go to Kubernetes with that, even though we directly get multi-tenancy and everything else that comes along with Kubernetes. So we start with that, we’re using AWS EKS here, and presumably, you’d be able to use any Kubernetes cluster to do the same thing. We then slap a K8s autoscaler in there. And that’s what gives us our cost-effective scaling. So whenever we add some pods, the scalar scales the cluster up and down. And that works beautifully. So not much to think about there. We then used Argo workflows, So Argo workflows, for those of you who aren’t aware, Bernie mentioned that actually, it’s been talked about several times already, and in the DoK Community, but just a quick overview. Here’s an excerpt directly from the Argo homepage. And in short, Argo Workflows is a Kubernetes, native workflow engine, where you define each of your nodes. And here’s an example of a DAG over here, you define each of your nodes, these little checkboxes as containers, and then you string those containers in a directed acyclic graph, which you can see here. And I have shamelessly stolen this image from ArgoCon because I liked it quite a bit. Here’s a workflow that teaches you how to make a cake. And you can imagine that each of these is, maybe this one’s written in Python, this one’s written in Go, this one’s written in something else. You can do a lot of things with Argo workflows, it’s great. Plus, it’s Kubernetes native. So everything nice about Kubernetes is also nice about Argo workflows. And, of course, you can pass information in between tasks. As you might expect from a workflow engine. So back to the stack. So once we’ve added our Argo, we get some black-box end-users generic, like I just mentioned, you can put any node, rather any image inside of any one of these nodes. We also get dependable scheduling, which is super nice. But Argo workflows by themselves do not solve our requirements. Mostly because, spinning up the pods always takes some seconds, which may be for you is going to be fine. But for us, spinning up 100,000 pods for one pipeline is going to take hours. And that’s just not something that we can do. So, that one is kind of broken if we only rely on Argo workflows, also, self-service and the shared development experience are both broken. Because a normal developer, someone who doesn’t know Kubernetes, and doesn’t know Cloud would have to learn kubectl and these types of things to interact with Argo. So we wanted to avoid that. I’ll come back to those broken points in a bit. But we also use Prometheus and ELK for our logging, our ELK sits outside of our Kubernetes cluster, because we found it to be a bit more stable. Here we just use the open stack service from AWS as opposed to, managing it ourselves, so it’s also easier and we have PostgreSQL which then will archive all of our Argo workflows, so we know what ran when how long did it take, we can go through we can crawl history’s it’s a nice connection there. And of course, Postgres can also be our data, or rather, can provide some of our data, but not at the high throughput that we’re thinking about here. Because again, we can potentially have 1000s of workers all working at the same time, and you can’t effectively have 1000s of workers all talking to the same PostgreSQL, pushing and pulling data all at once. So, we add S3 in there with a data lake type approach. And that gives us our data throughputs. And to make this all version-controlled, of course, infrastructure as code is your friend, we chose to use Pulumi for that because it can manage all of these things all at once. And we develop everything in GitLab. Right, I think I mentioned that one previously. And then finally sitting on top of all of these individual tools, is Dask. So here’s another side about Dask. So many of you are probably new to Dask, So what is Dask? It’s used for high throughput data pipelines in Python. It’s a parallelization engine in Python, which has been around for quite a while. And it’s quite nice that it splits the concept of pipelining, into data collections, a task graph, and schedulers. And basically, you can mix and match these things together. And in fact, you only have to think about the collections, and maybe the scheduler as well. But Dask always manages the task graph on your behalf, what’s nice about that is, you can use Dask collections, like Dask arrays and Dask data frames, which are super similar to NumPy, and Pandas respectively. Or you can use the interface of the future for kind of normal batch processing type workloads. And you can build up a task graph, basically, in the same way, that you would program any other Python script, so it almost looks indistinguishable. And ask, we’ll put that into a task graph for you. And then it’ll run that task graph on whatever scheduler is available, be it a single machine that runs, single-threaded, or multi-process machine, or even a distributed setting, like a Kubernetes cluster, where you have multiple pods, which are all individual workers. So tasks can span all of these different domains. And what’s nice about that is the developers on our team only have to think about the collections layer. So it’s as close to pure data analytics Python as you can get. And, we only have to think about the scheduler side of things more and less on the infrastructure side, Dask expands that gap for us. And Dask is nice, because it’s useful also for any kind of parallelized workload, so embarrassingly parallel, totally fine. MapReduce type workloads similar to Hadoop and Spark are also possible. And, really any other full task scheduling thing you can imagine, be it messy, or otherwise, you can also do quite easily with Dask. So Dask comes out of the box with all these nice features, easy to use, especially for people who are coming from the data side of Python, I think they would be very familiar with the desk interface. And if you do have trouble with Dask it has been around long enough that it has a lot of services available, you can get Dask as a service from people like Coiled for one is one deck here and there’s also Saturn Cloud, and maybe even also a few others. So if you need help with Dask, you can also find it. As mentioned, Accure did have to implement some of our things on top of the Dask to fit it to our particular use case. So things like work avoidance, and task artifacting. And logging we had to do. Those things are kind of out of the scope of the talk today. So I’ll skip over them for now, back to the stack. So once we’ve added Dask, we then have our batch and shared parallel processing is checked. Our black-box generic function is double-checked here because we both have it checked, in my opinion from the Argo workflows perspective, where we can, we can string together any, any number of like Kubernetes pods or containers that we might want to, but even within even across many pods, we can scale generic Python functions, as well. So in my opinion it’s very strong, generic and also very strong scalability. And Dask also gives us our ultra-low latency, which occurred knowing that we’re going to need quite a bit. And we are still missing our self-service, which we get from these tasks and service companies. As I mentioned, we use Coiled. And to finally make this shared development experience, so that the developers can just think about one collections layer, we have our extra utilities. And with that, all of our requirements are satisfied. And we can start building out our data pipelines. So that is the stack that we use. So someone might ask, How can we combine Dask and Argo workflows? So we’ll walk through this briefly at a high level. So Argo defines Custom Resource Definitions and it defines, among other things, cluster workflow templates, and workflow templates, where workflow templates will define a DAG something like, it’s too far away. But remember that DAG that I showed previously, this one can also be considered a very simple DAG workflow template will define a DAG, and a cluster workflow template does it at the cluster level, whereas a workflow template does it within a namespace. So the idea here is to make a standard Dask scaling workflow template and to put that at the cluster level, and make that as abstract as possible, something that applies to any pipeline that one could imagine writing in Dask. So the idea here is, we have a cluster workflow template. And as long as we specify an Image Repository that we want to use, where there’s an image in there that has a Dask pipeline, and we specify things like pod resources and pipeline settings, then this becomes something super general. And it’s also fairly simple as well, I’ll go and go through it in more detail later on. But basically, the only thing that’s happening here is, the workflow template would first initialize the workflow, that’s going to, of course, get computed, it will then build out a Dask scheduler, and where the Dask scheduler is doing the networking, and it’s sending tasks around between the workers, it’s talking to the client over here to understand like what tasks need to be done. It also will do things like serving a dashboard that you can go to, to watch your pipelines and a few other functionalities as well. So you build out your Dask scheduler, you then have to pass the desk scheduler’s IP address to the primary pipeline, this is just a container that runs the python script, which knows to look out for this Dask scheduler and to start submitting jobs to it. And it also has to pass the IP address to Dask workers and the Dask workers would come in as a deployment. And it would spin up however many records that you want. And those records will be alive as long as the pipeline is getting computed. They’ll just be performing tasks that the pipeline needs to be performed. And then at the end of the pipeline, or rather, at the end of the workflow, once the pipeline has finished, there’s a teardown process, which makes sure to clean up whatever workers may still be online, other things that were created along the way. That is our standard scaler. And we can, we can tailor that to a specific pipeline, so for example, we want to compute, or rather we want to do the ELT pipeline for customer ABC, for instance. This would be a specific pipeline configuration around the general one. And here we just need to specify maybe which DAG we’re interested in. So we know exactly which image is being scaled out here. And we want to say how many workers we should use and whatever special pipeline arguments we might want to use. And as long as we specify those that can be passed into here and then. Everything is working well. And as mentioned, each of these nodes in the Argo workflows is a container, at least these orange ones are and in this case, these are all just a container which is built from these are all the same container which is built from the same image. This is just a simple Python 3.8.8 cluster image with Dask and distributed and whatever other Python packages you need, cooked into the image, and then doing things like activating environments and doing whatever else you might imagine needing to do in your Docker image for your particular pipeline. Nevertheless, it’s just an image that has the Dask scheduler, desk worker and the specific pipeline you want to run available on the path so that they can be accessed from Argo workflows. So then I want to talk a bit about today’s demo. Today’s example is going to be very simple, it’s just going to scale out a basic batch processing workload. And in fact, kind of one that doesn’t make sense to do because of how simple it is. Nevertheless, it shows off some of the core features that I’m trying to showcase here. And given my personal history in renewable energy simulation, specifically in Europe, I thought it would be fun to look at time-series weather data in Spain. And specifically, what we’re going to do is look at five different cities in Spain, Madrid, Valencia, Barcelona, Seville and Bilbao.

Bart Farrell 31:31

Don’t know why most people don’t, but that’s good. It’s nice to be included.

Severin Ryberg 31:41

I want to ask the age-old question, which is the windiest city? We all know Spain is a windy place, but which is the windiest? So what we will do today is I have an image and in that image is some simple weather data, time-series weather data. And I’ve written a script here, or rather a function in Python, which will look at that weather data, you just have to specify which city you’re interested in, it will look into that weather data, And at a specific timestamp, it will find which of those cities is the windiest. And then it’s going to do that one by one for each city for all timestamps. Each has independent tasks. So there you see, it’s me, of course, there are way more efficient ways to do this. But, each of these independent tasks will all happen on the Dask workers. And then that will get concatenated back into the pipeline worker, where we’ll receive a count of how often each city was the windiest, and then we’ll get a report for that at the very end. So that’s our pipeline. And you can find this code, if you’re interested in it you can find it here, and you’ll notice that my name is not here, that’s actually because this demo was created with the help of the pipe-kit team for another talk that we gave together. So shout out to the pipe-kit team. And that’s another startup that Akira is working with. They’ve helped us install Argo workflows and do some other troubleshooting. You can find them here. Nevertheless, you can see our code here. As I mentioned, this is going to contain our Dask pipeline. And, helping you get set up in Argo, and also installing everything and also showing the Argo workflow templates that we’ll use. So one more example of things that I’ve ordered. Alright, so the demo stack that we’ll use today, comparable to the full stack that we use on a cure, is of course going to be much simpler. So what we are going to do is just use Kubernetes Argo workflows, and then just, the Dask pipeline on top of that, and that checks most of the boxes for these requirements that I mentioned before, but of course, not all of them. And I think that’s okay, for a simple demo should be fine. Alright, I also want to talk a little bit about what will happen inside of the Kubernetes cluster. So we have a fairly simple setup, where we have the Argo server installed in the default namespace next to that is also the Argo Workflow Controller. And we have an Argo controller config map that’s used to control how Argo behaves. We then have this cluster workflow template standard one that I mentioned over here, this one in blue, the generic one. So that will happen. Also in the standard namespace or rather the default namespace, it’s actually at the cluster level. So it’s not in any namespace. And we have some reusable cluster roles that allow actions to happen inside of specific namespaces. And then within a customer-specific namespace, we have a workflow template, which will scale out the example pipeline that I just described, by using the standard scaler. And we have a cron workflow template, which will put that on a schedule so that it runs, for example, every Thursday at 2 pm. And, other customers specific configurations, such as IAM credentials, and API secrets, these types of things would also be in here as well. We will use a cluster-wide for Argo workflows installation, there are several different ways to install Argo, we will use cluster-wide because that makes things easier. And like I mentioned, we will have customer-specific namespaces. We will have a very interestingly named customer today. Very creative. And with that, let’s go live. Here we have my code. And you see me doing some testing, making sure things are working.

Severin Ryberg 36:53

This is the code that you will find at the Git repository that I mentioned previously. You see in here, the Argo workflows installation. And this is a very basic installation. It’s only lightly adapted from the Quickstart installation from Argo workflows themselves. And I think I’ve added in some specific things here for creating a namespace section, you see that the top yet, so creating a namespace, and a few other things that are specific to this demo. But for the most part, it’s just a basic Argo workflows installation. So I won’t go through that in detail. But if you were doing this for yourself, you should just be able to kubectl apply this. And actually, I’m cheating because it’s something you’ll notice already. But you will notice that there’s not so much going on to get pods. So it’s an Argo server and an Argo workflow controller. And normally these will come online in like, a minute or two. So they’re quite fast. So let’s talk about this standard Dask scaler template. I won’t go through this in explicit detail. This is an Argo workflow, templates a YAML file. And you’ll notice here that it’s just the YAML. It’s quite similar to other Kubernetes resources, you have metadata and spec and these types of things. And it does come in as a custom resource definition. So that’s where we get kind of close to the workflow template. And the template itself is defining some number of templates, individual actions that need to happen within this workflow, and then you give it an entry point. And the templates themselves can of course accept some parameters, some of those parameters having defaults and other ones being required. And then a step can itself have several sub-steps.

So here we are generating a Dask scheduler, like I mentioned, passing into input from above. And after that, we would scale out our workflow and here you see we pass on the IP address. So that’s a simple thing to do. Really nice with Argo workflows, in my opinion, and then inside the scaling, I just want to show off this real quick inside of the scaling templates, we use the template called Dask scalar. And this is another template, which has these parameters, which must come in, and we then create our primary pipeline, making sure to pass on the IP address, and also our desk deployment workers. And the primary pipeline is here. And this is a template, which is just a container. So this is where real computation is going to start happening. It’s going to be a container, it’s going to run this image that comes in as an argument. And it’s just running a Python script. And the Dask worker scalar, which this one here, is creating a resource. And you see here that it’s creating a deployment. And that is just a normal Kubernetes deployment definition with parameterization, super intuitive to use if you have a background with Kubernetes. And, you can get running with it right away. And then there are a few others cleaning this up, there are a few other templates in here to do some cleanup afterwards, as I mentioned, and that’s it for the standard scaler. Let’s talk a little bit about the specific workflow template for this pipeline that I want to scale out. So here is a workflow template, which only has one step in it. And that’s just a wrapping of the standard scaler, so it just basically references that scalar passes in some arguments here and that it doesn’t need to do anything else. Like, it defines some input arguments that we know, we always want to be true for this particular pipeline, it should always come from this image, for example, it should always find the Python scripts that we want to run inside of this folder within the image. We leave a few other arguments, like the number of workers and other arguments, we still allow those to come in, at runtime. That is fine. And there’s another nice thing, we get some semaphore synchronization. We can say this pipeline can only run once in parallel. And that can help us from, not overriding our database, if this was an ETL pipeline, and making sure that it doesn’t overwhelm like a customer’s API or something like that. And then if we want to schedule this, we just have to create these cron workflow templates, or rather cron workflow, where we just reference the Windy City template, this one up here. And we just say, please run for us. I think this is every Sunday at midnight, 13 minutes, if I haven’t forgotten my cron notation or every Monday, maybe I always have to Google it. So that is Argo. Let’s talk a little bit about the Dask pipeline itself. So here, now we’re in Python territory. And you see that we’re importing. We’re importing Pandas, we’re importing NumPy and distributed, which comes from Dask themselves, just to give ourselves to the client. And that is it. So there’s nothing else that we need to make this run on the Python side. And here is our function for getting the Windy City. As I said, it just takes a date. And it’s then going to loop through all of our cities, it’s going to read a parquet file, extract the date at that time step. And it’s just going to say, Hey, when the city is x city, at this time step, again, nothing super crazy happening there. And very inefficient. So don’t make your desk pipelines like this. There are much better ways to do this. If you’re trying to do exactly this task, the important thing to mention here is that this is completely like you could put anything in here. So this is just a Python script, there’s no limitation to what could happen inside of this file. So long as you don’t overwhelm the memory constraints of the worker that’s hosting it. And then our pipeline is going to I’m not going to go through it in detail. Again, happy to answer questions, later on, if someone does have specific questions, we just have to run our get windy city function, and submit it to our desktop client. And we do that one time for each timestamp. And then we just gather all those things. And Dask takes care of all the networking behind us or behind the scenes for that, this is something that I could run directly right now on my computer, without thinking about Kubernetes clusters or parallelization or anything like that, I could run it completely and scan it in, in serial, I could get into a normal debugger, completely in Python the whole way, even down into the tasks. So it’s really developer friendly, I would say. And then when I do want to scale it out, Dask handles that for me. And then here’s a quick entry point, which allows me to execute this from the command line. So I come back over here to my primary pipeline in the Argo workflow you see here, this is where it’s just running my Python script right there. And then that’s where we end-users get the connection into Python. And that is it. So I should be able to see kubectl get pods minus the end customer. So the customer is the namespace that I’m using, as I mentioned, and there should be nothing in there. And also Argo, so we have no workflows going. If I wanted to submit this pipeline, I’ve installed Argo workflows, but I haven’t added these templates. So let me do that real quick. So I will just apply them to real quick kubectl, task standard. So I’ll just apply the standard workflow templates, and also the customer-specific workflow templates. And whenever you want to do it, now we should be able to see the customer’s workflow. You can see that I now have my workflow template inside of Windy City. Or rather, I have the Windy City template inside of my customer namespace. And I can also see what is currently scheduled. And you can see that my Windy City template is scheduled to run on that cron schedule that I mentioned previously. And the next run will be in five days. So some nice access to the current state there. And if I want to run that, again, I would just need to say, Argo. Understanding customers’ and submitting. So the notation here is I want to submit inside of the customer namespace, this template. And I want to override these arguments. And as soon as I run that, you see that the Dask scheduler has already come online. And it is a daemon template. So because of that it stays running the whole time. Whereas the deployment workers just make the deployment so that’s why it has a checkmark there. And if I want to check out what’s going on inside of that, I can do kubectl. I need to get this pod. So I mentioned that the Dask has a dashboard, where you can check out customers and do ports forward and paste that pod. I want to forward Port 8787 because that’s the port that Dask uses for its dashboard. And hopefully, if I haven’t missed it. We are in business, you see that the pipeline is running. And it’s still adding these tasks. So I didn’t describe this notation that I was using over here. Which of course you can’t see anymore. But you’ll notice that I said timesteps equals 10,000. That means I’m saving submitting 10,000 individual jobs. So that’s why this number here is still climbing.

Severin Ryberg 51:23

So you see that it’s scaling out here. And we have, you’ll notice. I didn’t show this before. But when we started this, we should have seen that there should have been only one online node. And now we should have more than one, kubectl wants to look like this one is finished.

Severin Ryberg 53:10

What’s nice about this setup is when the pipeline started, there was only one node online. But the pipeline was running, you’ll notice here that we have more nodes coming online. And you’ll see here that three of those nodes are much younger than those two of those nodes, which were online before. And that’s quite nice, because, the pipeline, the whole cluster scales up as the pipeline needs. So that’s what I was trying to showcase there, you get a dashboard. If you get scaling. It’s all well and good. So that’s it for the demo, short and sweet. And I could go into more detail again if someone was interested. But I think that’s enough for the time being. So let’s just have some closing remarks. So the question is, when would you want to use Argo workflows, plus Dask? And when would you not want to? So I’ve found that this tends to work well when you’re working in a Python heavy team. And specifically when you’re wanting to deal with scheduled workflows. So when you have scheduled automated workflows, written in Python that needs to scale out, maybe they use Dask, or maybe not, maybe use an open MPI or something like that. Also, I find that I find that the Argo plus Dask combination is a really good one. And we’re quite happy with it. But for the periods where developers are in the loop like we have some dev who’s testing out a new feature or a new algorithm. These like Dask as a service tools such as I mentioned before, Coiled or Saturn cloud, tend to fit well because the developers can operate completely outside of the Kubernetes space. And they just say, Hey coiled, give me a cluster, I want 100 workers inside of it, that might be a little bit more expensive than it would be in the Kubernetes setup. But it also works fine. It’s also important to mention that Dask is not a silver bullet, we have experienced some instabilities, especially for long-running pipelines. So we found that pipelines that take hours are generally fine. Whereas pipelines that take days can sometimes have issues. You also can do batch processing, like I showed today with a Dask, but it’s not the main focus of Dask that’s focused on the shared payload processing type use cases, with tasks, data frames and Dask arrays. Also, it’s important to mention that this Dask is an active project. So it’s also evolving itself. And I also want to mention that there are several Dask alternatives that you could consider using if you’re interested in this, but if the Dask isn’t the right fit. So one example would be Couler, which kind of allows you to create, develop, or develop Argo workflows. But to do so directly in Python. This wasn’t the right choice for us because of this latency issue, but might be the right choice for someone else. And there’s also a package called Ray, which is quite similar to Dask. In many regards, it also allows for low-level parallelization in Python. I don’t have a specific or significant personal experience with it myself. But I’ve heard that it’s also more stable than Dask, but maybe less featured. So there is a bit of a trade-off there. And with that, we are done. Hopefully, there are enough questions in the air and time in the end for questions.

Bart Farrell 56:58

I’m sure we’re good. But I think also, you did address it because you were talking a lot about the positive sides and the benefits that Dask is providing in your particular case. But as you said in the very beginning, you try to match the tool to the developer, not the other way around. So be sensitive, that based on people’s level of experience, and what they do well, some things are going to work better than others. We’re talking about as you’ve seen, sort of this, this progress over time, we’re talking about infrastructure and looking at Kubernetes, one of the fundamental questions that come up in our community is how Kubernetes was not originally designed for data in mind, which is why our community exists, stateful workloads, etc. What do you think are the biggest challenges right now that are preventing organizations such as Accure where you’re working, end-users from adopting this strategy of running data on Kubernetes? Perhaps in the case of Argo workflows?

Severin Ryberg 57:51

It’s very new. And, I would be lying if I was saying that it was like all rainbows and unicorns, using this approach on our side, we have had our fair share of headaches with it. And it’s a very manual solution. So it requires a lot of Kubernetes knowledge, a lot of Dask knowledge, a lot of distributed computing knowledge and cloud knowledge, to pull it all together. And that might not be a setup that everyone has and not just that, but also, as I mentioned, Dask wasn’t originally intended for this purpose.

Severin Ryberg 58:36

And some of these other tools, like Ray, might even be a better fit there. And like I said, I haven’t done that myself yet. But because of that, I could imagine that being a bit of a liability, because it’s not a proven technology like Spark or Hadoop or something like that. But I do find it to be quite an effective one. So I would say that to answer your question, in my opinion, one of the main things that are holding people back in this regard is probably just breadth of experience because end-users end-users like all of these things are possible when you have a foot and all of these different buckets. But when you only know Kubernetes, or you only know Python, or you only know x tool, you might not necessarily think to put them together.

Bart Farrell 59:33

Got it, it’s a very good point, like the logic might not necessarily be there. And going back to a really good point that you mentioned that hasn’t necessarily been addressed in some of our live streams, although I’m sure a lot of people would agree that looking for proven technologies in an area that’s so innovative, there’s going to be scarcity, like some of this stuff just hasn’t been tested well enough to go to an innovation manager or a CTO and say, “Oh, don’t worry, this is a track record of five years or in the case of Spark and Hadoop”, we’re talking now beyond 10 are getting close to 10 years. So that level of being battle-tested I think is going to be providing some doubts there. But once again, that’s why we have the community to get practitioners together to be talking about this, in terms of the benefits that you’ve seen provided to your organization because choosing the strategy will be some of the key things that you would say have stood out for you that could perhaps serve as an inspiration for other organizations that are considering taking that route.

Severin Ryberg 1:00:28

I would say the two biggest strengths that I would like to highlight here are the genericness of this setup. I mentioned that before, Argo workflows are already super generic because you can just string these containers together. And for people who prefer to stay in the Python world, and you really can’t get more generic than what Dask allows you to add, like a super low level, it provides this high-level interface that allows for a lot of things, and that includes ML type applications, ELT applications, or other things that need parallel processing. And the other real strength of this, in my opinion, is, bringing the scalability to the developer who doesn’t need to understand anything about Kubernetes, and doesn’t need to understand so much about things like distributed computing, and these types of things. And it allows people to specialize, which is very important in a curious case.

Bart Farrell 1:01:48

That’s a good point. And I liked what you said and there’s now going to be a documentary coming out about Kubernetes. And there’s been plenty of conversations about this, about how maybe not everybody needs to know absolutely everything. So like you said, the importance of specialization, a lot of this stuff can be in the background, behind the seats under the hood, etc. For those who need to know, they will know. And for others, it’ll just be something they won’t necessarily be interacting with directly. So I think those are very good points. That being said, if people want to find out where you are, how can they find you on LinkedIn, Twitter? What’s the best way to do it?

Severin Ryberg 1:02:23

Good question. I meant to put that on my last slide. I can go ahead and write them down. And go back to my screen.

Severin Ryberg 1:02:56

My real first name is so my GitHub is going to be sevberg. And LinkedIn in case you’re interested, here’s the twist. David Severin Ryberg. So, my first name is David, but I go by my middle name separately.

Bart Farrell 1:03:42

There you got the email, GitHub, LinkedIn as well. Also the DoK community in Slack. So I’m sure for folks that are watching this if you want to take the conversation a bit further, feel free to jump in. As I’ve said, we’ve had other talks around this, we’ve never had to talk about this match up with Dask. And taking a look at the way that you did. Also very nice to see from an end-user perspective, that’s nice to see that too, also seeing your background in looking at renewable energies. That’s probably a conversation we could take further. But I appreciate having you here.

Severin Ryberg 1:04:23

Absolutely. Alright, perfect.

Bart Farrell 1:04:27

So, we have a tradition in our community that while our speaker is giving their talk, we have an amazing graphic recording artist who’s behind the scenes, creating a depiction in real-time of the things that are being mentioned. Unfortunately, we’re not able to get conclusive evidence about the Windy City in Spain, so we’ll leave.

Severin Ryberg 1:04:45

Oh, I forgot that part.

Bart Farrell 1:04:46

I would put substantial money into it. I like to go running and you can run in it with the wind blowing towards you on the way out and the wind blowing towards you on the way back here. We also have a time when planes land in the airport here. That’s always a bit of an interesting experience because the airport’s put in like a wind tunnel in the valley. So the wind comes in quite strong, but I’d be curious to know how the other Spanish cities rank. So that’s a conversation we can continue later on. But anyway, Severin, thank you very much for your time today. Extremely informative. We’ll want to have you back as soon as you can expect to hear from me and wish you nothing but the best.

Severin Ryberg 1:05:28

Alright, thank you so much.

Data on Kubernetes Day Europe 2024 talks are now available for streaming!