Metaflow is an open source machine learning platform, originally built and broadly battle-tested at Netflix to help with ML, analytics, and other data-intensive use cases.
It was initially built to heavily rely on AWS, but it’s been rebuilt to run large-scale data crunching workloads on Kubernetes. In this talk, Oleg shares the surprises and challenges they’ve run into in our journey to get to a consistent, friendly experience on Kubernetes for data scientists.
This talk was given by Outerboundes, Co-founder Oleg Avdeev as part of DoK Day at KubeCon NA 2021, watch it below. You can access the other talks here.
Bart Farrell: Oleg is going to be taking us in a slightly different direction telling us about Metaflow. We’re starting to move into bringing machine learning onto Kubernetes. So Oleg whenever you’re ready, we can bring you on. Hey, what’s up, man? How are you doing?
Oleg Avdeev: Hello. Hi, guys. How are you guys?
Bart Farrell: Good. We can see you perfectly, we can hear you perfectly. It’s all yours, man. Take it away.
Oleg Avdeev: Amazing. I’m glad to be here. I’ll talk a little bit about our little project called Metaflow and how we brought it to Kubernetes. But first, let me spend a few minutes talking about what Metaflow is before I jump into all that kind of juicy Kubernetes stuff. Hopefully not going to take too much time to explain it. It is a machine learning infrastructure framework that is open-source. It was developed at Netflix originally, a few years ago and became open-source in 2019. Coming up on two years now. It was built to help data scientists and machine learning practitioners, machine learning engineers at Netflix to ship their models and whatever they developed to production. And the way we’ll look at it is there is a continuum of things that people do in those kinds of roles. It’s all kind of mixed up now. This is a growing area of machine learning AI stuff. But primarily people who work on machine learning projects care the most about statistics, building new, exciting deep learning networks, things like this, feature engineering, and deploying their models to production. And maybe they care a little bit less about other equally important stuff or infrastructure. But maybe they are not exactly in the job description, like data warehouses, provision of compute resources, scheduling those jobs, and orchestrating them. Even though they’re usually excited to learn more about all this stuff, that starts to take 90% of their time. Maybe the infrastructure of the company is not ready to get to this next level of ML (machine learning) AI evolution. Metaflow is essentially a Python library that was built to make life easier for data scientists, to make it easier for them to package a code and ship it to production. As you may know, Netflix is a pretty big company. I wasn’t at Netflix, I’m currently, by the way, at a company called Outerbounds that was born to bring Metaflow to the outside world and make it work for more people, not just Netflix. I and my co-founders (who have worked at Netflix) have been working on Metaflow for years. They built it to cater to the Netflix infrastructure. And as you probably know, Netflix is pretty AWS (Amazon Web Services) heavy, and they have a lot of internal orchestration, container orchestration platforms that are not Kubernetes. There’s a little bit more.
A quick example of what Metaflow looks like in code. Your machine learning workflow doesn’t care about machine learning that much. I mean, it’s a fancy label these days, but it’s just the library to execute pieces of your Python code. You break it down into steps and those steps are automatically converted to containers that run somewhere. And at this point, maybe the data scientists don’t care that much about infrastructure, they just care about their code being executed on time, and the fact that they get some compute resources, GPUs, CPUs, memory, that kind of stuff, to run that code. We like to think that Metaflow is very easy to learn for someone, they don’t have to learn about containers. I mean, they have to learn about containers, but not necessarily about, for example, Kubernetes, or, like 17 containerization solutions that AWS offers these days. So it all happens automatically, they just create a class and they create some methods there. And those methods magically get converted into containers, and all the dependencies get packaged, and all artifacts that the stock produces gets tracked. All this gets stored in the data store, in kind of an abstracted away, but not too much from data scientists.
Another thing that we like about Metaflow, and as you probably know, there’s a bunch of similar projects that also claim to make machine learning and data scientists life easier, what we care about with Metaflow is the user experience for those data scientists that it’s very pythonic, It’s very natural, they don’t have to learn a lot of new concepts, and it’s very easy for them to go from the prototype that runs on their laptop and then maybe start running some parts of it in the cloud and maybe storing data, not locally, but in some object-store. As they iterate, they can run more and more of their workflow remotely in AWS or on the Kubernetes cluster, and then kind of graduate this thing and start running a production workflow without anything running on our laptop. Everything is now built with the best practices of the DevOps team. So for us, we think of Metaflow as this Python layer/productivity layer on top of whatever infrastructure you may have. And like I said, Netflix, they have this system called Titus, which I think is open-source, which is an orchestration solution similar to Kubernetes. For open-source, for a long time, Metaflow was pretty much only on AWS native services, and specifically AWS batch and step functions. If you know a little bit about that world, it’s kind of the AWS equivalent of many things that you use day to day in other clouds or Kubernetes.
Now the exciting thing for us is that over the last three or four months, we were hard at work at bringing all that to Kubernetes and making Metaflow Kubernetes-native so you don’t have to be locked into AWS so much and data scientists can use Kubernetes efficiently without necessarily learning all those concepts. Because as you probably know, there’s still a good amount of things that you have to learn, new concepts that you have to learn to effectively use Kubernetes. And it’s not always the case that data scientists have the resources/bandwidth to do this. So we’re trying to smoothen this out.
For Kubernetes, what we’re going to do is, those small steps that I showed you just previously on this slide, will be converted to Kubernetes jobs that will run Kubernetes. We will orchestrate everything with Argo if you want to ship your workflow to production. If you don’t, you can just run parts of your code and Kubernetes easily by using those nifty Python decorators. And then be done with it.
This, hopefully, will become our main recommended solution for people running those data-intensive workloads; to just use Kubernetes and kind of stop caring about anything else. And that brings them all the benefits that you all know when it comes to configuration management, and multi-cloud story and security. In this talk, I want to talk a little bit about our journey, how we built this because I did use Kubernetes for a few years now. But I cannot call myself kind of a huge frontier man. That’s the word. But I still feel I’m new to the whole ecosystem. There are a few challenges and a few things that we learned about Kubernetes, some of them were very positive surprises for us. Maybe I can offer some perspective on how Kubernetes can be even more friendly to this whole machine learning data science industry. I think, especially when it comes to data and data-intensive workloads, it has something unique to offer compared to different sets like vendor cloud-native offerings.
The first thing is a challenge that we encountered. I wouldn’t say it’s unique to Kubernetes, really, but as you well know the biggest advantage of Kubernetes is that it’s a completely declarative approach to infrastructure. You just describe things in this YAML, domain-specific language and things happen. So say I need to have a service, I have to have an Ingress, secrets, whatever running in my cluster, I do kubeCTL apply and it just happens. My infrastructure kind of smoothly moves into the state that I just described, and I don’t care how it got there. So the challenge with machine learning or orchestration or orchestrating data pipelines is really that the final state doesn’t matter that much. The final state is always “nothing is running”. If I scheduled my machine learning training job, I know that the end state is not that interesting. But the most interesting part is how we get there because my job needs to run, it can fail for 1000 reasons as I iterate, it may start requiring more memory, it requires a GPU, it times out, something happens and all those intermediate states are the most interesting things if you think about it from the data scientist’s perspective, or from the perspective of someone who has to debug and iterate on those workflows. This is the flip side of the benefits Kubernetes has with all this direct declarative infrastructure, that is, a lot of things happen through operators and controllers, and they all happen synchronously. So there’s a lot of time in these intermediate states and with controllers, the controlled objects, like the grants officer that the controller creates, may not always match the state of the parent object, or custom resource. That was something we had to spend a lot of time on, reconciling those states and making sure they’re exposed. It’s coming back to the topic of the data on Kubernetes. This is also true for stateful workloads because an edit school work is very declarative. When you talk about making a backup, this is the opposite of declarative. This is like an imperative action that you have to take and you want Kubernetes to take. It leads to a little bit of friction in terms of creating the best possible human interface to this. That was a little bit of a challenge. On the positive side, when it comes to data-intensive things, everyone was telling a story that Kubernetes and anything that has to do with data on Kubernetes is scary. It’s not as bad as I expected it to be. It’s matured a lot since I looked at it the first time, there’s a lot of tooling, as you know, there’s a lot of both Ceph and MinIO. If you want to run object stores on Kubernetes, it’s a relatively smooth and boring experience, coming from this AWS vendor-specific land where you have to pay a lot of money for those things, and they still don’t work. A lot of the stuff is smoother on Kubernetes, even more than some of the commercial offerings that I have screenshots of AWS functions just for fun and patch. They are super clunky. Data storage these days is a lot easier to get started with and just doing that even provides for people to debug all the issues they run into when running those workloads. When it comes to storage, there’s a lot of time invested in Metaflow in creating utility libraries. When data scientists write those workflows, they work with a lot of data. Usually, it’s stored in some object storage and the Metaflow team at Netflix spends a lot of time making sure that performance is optimal. It sounds easy to download files from the object store with “get object” but there’s a lot of things that can go wrong.
It’s easy to download things. It’s hard to make this object store perform great, even if the actual object store is owned by AWS, we talk about S3, writing a decently performant client for S3 from downloading files at full bandwidth, like 10 gigabits or more per second is not trivial. So we spent a lot of time on this. As we started to migrate Metaflow to Kubernetes, we started looking into Grenada state of data storage solutions and there have been good surprises. Since we support the S3 API with out-of-the-box compatibility with MinIO and Ceph and that works, there’s still a lot of work ahead of us to make sure the performance is on par with vendor-specific solutions. I used MinIO years ago but it wasn’t as easy as it is today. They’ve made huge progress in making this accessible. It’s still not 100% clear which way someone should go. I’m hoping that the takeaway from this is I’ll get a better idea of what offering or open-source offering on Kubernetes is the most stable and something that we can recommend to our customers. One small takeaway from this talk is that there’s a unique opportunity for this community and for everyone who cares about Kubernetes who isn’t specifically writing data-intensive workloads on Kubernetes, compared to vendor offerings, Kubernetes, as complex as it is, in some regards, this is an opportunity to simplify authentication and access story a lot for those people. One example is about two weeks ago, we had a joint blog post with Seldon (if you know them, they develop an open-source solution for model deployment on Kubernetes), and we took a stab at it, we try to imagine how full end to end workflow for data scientists would look like if they were using Metaflow for orchestrating batch jobs like training, getting the data extracted from whatever, and then running this machine learning models, train them, then deploy them to production as microservices using Seldon. 90% of this just works. It’s relatively intuitive. But the biggest caveat was always with authorization because then we were doing this with the AWS-native version of Metaflow, not the new Kubernetes version. Getting all those credentials through Metaflow tasks, connecting to S3, Seldon micro sources getting access to S3, Metaflow steps, being able to deploy things on Kubernetes services. All this piping still leaves a lot to be desired. That’s mostly because we were using AWS, trying to combine AWS-native storage like S3 with Seldon which is fully Kubernetes-native. So imagine if there was a solution, like MinIO or Ceph, maybe relying on service accounts for auth, maybe streamlining the story a little bit. That would be ideal because this is the biggest obstacle that we see that people run into when they try to deploy Metaflow or similar ML ops platforms, there are so many moving parts that have to talk to each other, and they all have to authenticate. It’s just really hard to get this done on any public cloud even though they have put a lot of effort into making this smoother. Kubernetes with this unified API for everything, and operators, and infinite extensibility may be the answer here, or at least I’m very optimistic that it is.
That’s pretty much it. I didn’t have a lot of time but talk to us on Slack or you can shoot me an email or on Twitter. I’ll be on the doc Slack for a while if you have any questions. But that’s it.