The CFP is open for DoK Day at KubeCon NA 2024 through July 14.

Submit Now!

How to Create Your Own Metadata-Driven ML Platform from Scratch

This talk demonstrated integrating a data lakehouse with a metadata-driven orchestration engine, utilizing open-source tools like Presto and Kubeflow. Other key learnings included: 

  1. Learn about Presto and how it works with the open lakehouse concept for efficient data analytics. Understand the basics of setting up Presto, connecting it to your data sources, and executing your first queries. 
  2. Explore Kubeflow, a Kubernetes-based platform for ML workflows. We walked through setting up Kubeflow, creating metadata-driven pipelines, and deploying notebook servers. Learn about distributed training operators, hyper-parameter tuning, and model serving within the Kubeflow ecosystem. 
  3. Automating your ML lifecycle, from data ingestion to model deployment. Understand how to implement traceability and manage your workflows efficiently using the tools discussed. 
  4. Learn how to integrate real-time data streams with your ML platform, ensuring your models stay relevant and effective.

Speakers:

Watch the Replay

Read the Transcript

Speaker 1: 00:00 Welcome to the talk and thanks for joining our session. Today we’re going to share the journey of how we use the various open source projects to set up a main data-driven ML platform. My name is Y Hong Wang and this is my coworker Ted Chen. We are working at the open source AI technology group in IBM research. Both of us have been working on various open source project, I think at least eight years. Most of the projects we work on recently are related to machine learning like cool flow pipeline, TensorFlow, Ray, re, and Flink training operator, just them a few. So we are committed contributors to maintenance as well as actually active cases. So based on our experience and explorations, we conduct a ML play phone. I’d like to share it with you. The play phone comprises many open source projects, so we’ll go through them one by one and provide the information you need to set up the platform. So hope this can help you to seamlessly integrate ML into your business solutions.

01:21 So here is the agenda. So we will actually cover two main topics and those are very important for the ML enablement. The first one is the open deck house. I think open deck house Lakehouse itself has become a very trendy topic these two years is the foundation actually of the ML platform that we construct. It combines the feature of the warehouse and data lake and is used by many MLL persona for example like data analytics and the future engineer and data scientist. So I will cover how to build a open lakehouse and IT key component later. And the second part is the meta data driven ML pipeline. So you are not able to actually adapt the ML successfully without an ML pipeline framework. So here we believe K flow is a perfect solution for lease and K flow is actually a very big umbrella for a group of ML projects with a billion in mayada driven pipeline called K flow pipelines.

02:33 So Ted, he will share more information about it along with other projects that you can easily integrate with the K flow, including a KT for hyper tuning, a training operator for model training, K server for model serving, et cetera. So without further ado get started. So I think first thing first they start with the ML lifecycle. The high issue right now everyone is tackling here is the bird’s eye view of the AL lifecycle. So it actually contains three pillars, data model and deployment. So yeah, data model and deployment. So the lifecycle always starts with fun data and then you use the data to build the models. They automate decisions. So in a data phase, you acquire data and you x-ray and analyze the useful features. In the model phase, you use the state of art technique to compose and train your model. Finally, in the deployment phase you deploy the model and integrate them into US solution or products.

03:41 Actually let’s dive into the detail of the ML lifecycle and have a clear picture of it. I believe if you try to search M ls lifecycle, you will see a lot of different flow and diagrams here is just one of ’em, but I believe all of them actually all carry a very same message. The end-to-end ML lifecycle is extremely complicated. So here you can still see the three phases I mentioned earlier, the database, the model phase and the deployment phase, and different you see each phase is actually institute of multiple tasks and operations as you can see here. So different tasks are conducted and performed actually by different teams with specific expertise and skills. For example, data engineers works on data injection data transformation and data scientists work on data analysis and feature illustration. ML practitioners work on model composing and model optimization. ML engineers work on model training and scale and model validation and DevOp and software engineers work on model deployment and application integration. So obviously having a platform to facilitate the entire ML pipeline and enable various teams to work on their task becomes a must have to success in the ML enablement.

05:27 So when we adopt the ml according to our experience, you will face the following challenges and we try to categorize it in two categories. The first is the data preparation of course because everything different the data and the data volume keeps growing and growing in the ML world. And then you need to be able to retrieve and process data fast enough to meet your deadline, achieve a faster time to market strategy. And the second is the data type. They are structured semi-structured or unstructured data types. And I believe for most of the ML solution you need to deal with semi-structured data and even unstructured data, AKA, we call it schema on read, but you may also need to handle structured data from time to time. So we need a solution to be able to handle this different type data type without complicated data moving and transformation processes.

06:33 And also the flexibility and elasticities data usually come from multiple data source and they are different type of workloads in data reprocessing for example, like interactive or ad hoc data queries or batch processing. And of course how much the platform would cost is one of the important factors. And if the platform is using open format or vendor locked format, it’s also important. And for the second part, the ML complexity is always one of the challenge. We know the ML lifecycle earlier is very complicated and require correlation from multiple teams and the more people work on the same system, the greater likelihood of errors and the harder to debug. So error pro is another challenge and the entire ML lifecycle is a lengthy process. You need a system that is flexible enough to actually adjust the resource based on the workflow and entire platform. So last but not least is actually the automation because it’s very lengthy. So definitely we want an automation mechanism to speed up the ML life cycle and the mechanisms need to be easily manageable and can apply to various kind of tasks because you can see each of the ML tasks or operation they require different methodology or techniques.

08:10 Okay, so here is our Did I sorry, I actually, okay. So apparently I think our solution for handling data reprocessing the load challenge I mentioned earlier, the answer is very obvious. It’s actually the open lakehouse. So open lakehouse stitch together the features of the warehouse and data lake alter data and bring the traditional data analytics and advanced functionality for MLS scenarios, a lakehouse is still used AB object storage to store the different type of data which keep the cost low. And one of the key component of lakehouse is actually the processing engine and usually it actually can provide the capability to connect various data source and run queries at scale. And based on the name and open Lakehouse definitely use the open standard and is solely formed by using open source projects. And another key component of the lakehouse is actually the table format. Table format forms an extraction layer above the storage which provide reliability and great performance while processing huge dataset including asset transaction caching, oral partition time travel, compaction, et cetera.

09:42  And I think I will cover it in just a bit. So here is actually a high level open lakehouse architectures and actually comprise multiple layers. So starting from the bottom layer you can have SQL database, cloud object storage, HDFS and more and above the data storage is actually the open file format including Parque Al. And as for the table format, there are several choices including Apache, iceberg downtown Lake, this one and as the Apache hoodie. So different table format actually provide different features and in our solution we use Apache Iceberg because we believe it provide more features that are suitable for ML usage. And I will cover it in just in the next slide. And above the table format is the key component for the lakehouse is the processing engine. And here we have the Trino, Presto and Spark, Trino and Presto. These two are the SQL engine and Spark is actually more for use for the batch processing. And then on top of the processing engine you can have a bunch of open source visualization tool or business intelligence tool. They can connect to your Lakehouse and do the query data analytics and also other dashboard stuff.

 11:22 So like I mentioned earlier, we choose in the table format, we choose the Apache iceberg because it is actually designed for handling meta by scale dataset and solving consistency and performance issues when storing data on cloud storage. And it’s actually initiated by the Netflix and they donate to the open source community and use a series of meta data to check the data change and meanwhile maintain the performance of data assets. So table change, create a new stand shop which are linked with multiple data file and use the atomic switch to support atomicity consistency, isolation and durability, a K the asset and which ensure the data reliability and integrity even with multiple assets at once. So it actually also provide other features like time travel, the ability to query the table based on your state at a certain point of history. And also partition evolution allow you to change the columns on which table is petitioned over time. And also the schema evolution, it allow you to add rename, reorder and delete columns.

12:45 And as for the processing engine, we use the Presto, Presto or some people call it Presto database, but it’s actually a SQL engine. It’s an open source distributed SQL engine. They can query large dataset from various data source ranging from regional database, no sql, real-time data source and even streaming data source as well. Presto connect to those data sources using a flexible plugin mechanism we call connector and query the data where it reside without moving the data around. So Presto is create to for the interactive and ad hoc data analytic at very beginning. So it leverage the distributed architecture in memory processing, query optimization, hierarchy caching, and a lot of features to perform the query in the lowest latency. Whether you are dealing with petabyte or data or real time streaming analytics, press or ensure that your query can run animal latency enable you to make the inform decision quickly. First all run SQL queries because it’s a SQL query so it comply with the nsy SQL standards. So I think in the data and world, most users already know how to write the seal query. So preso is easy to access so you don’t have to actually learn a new language. Last but not least, the Preso is an open source project with a big community and is used by many companies including the original company who invented is the meta and the ance Uber, IBM and more.

 14:32 And for the very top layer I mentioned earlier, it is the visualization and the BI tool unless they, because the Presto provide the JDBC and rest API. So you can easily set up those BI tool to connect to the Presto costs and to retrieve the data. So Zel community also provide Python and JavaScript XDK for integration. So you can easily integrate BI tool using those SDK. So for us, we try to set up the zip link is one of the data visualization tool. It used the CDV connector to connect to the preto and we also set up the Jupiter notebook, which is using the Python SDK to connect to the Preto. And on the right hand side here is actually another visual agent tool is AP superset. You can also integrate it with the Presto easily, but in our deployment we don’t use it, but I highly recommend that you can use it as well.

 15:41 So here is a complete topology that we set up for the data reprocessing. You can see we deploy everything from the Kubernetes of course you can actually switch to OpenShift for sure. And on top of it we have the, sorry, we the menial cloud storage and we also have the MongoDB, which is the semi semi-structured data. And MySQL is for the structured data. And we used, like I said earlier, we use the iceberg table format and this main is used by the iceberg to store the main data information. And on the pres layer we actually use the per helm cha to construct a three cluster. One is the cluster coordinator and the other two other workers. And on top of, like I mentioned, I use the Apache zip link to do the visualization and the GPN notebook for data scientists, they can operate, they can try out the personal CQL and do their data analytic stuff on the GN notebook.

16:50 I think I want to demo some of them, but we try to see if we can run their live demo. But we found the network is not that stable. So we try, we actually record the video earlier. And then let me show you. So here is the press cluster. It actually has a tunnel cluster like I mentioned earlier. And in the personal dashboard actually I just recently contribute the SQL client into the community. So you can actually run the SQL query directly on the ui. And in here you can see we actually have connect to different data source. We have the Iceberg MongoDB, MySQL and some benchmark database.

Speaker 2: 17:52 I want

Speaker 1:  17:52 To, yeah, and like I said, you can run Thes Query and here you can see actually I just tried to run a federated query. It actually query from the Myco DB and MongoDB and it joined those two table from two different data source and give you the result.

Speaker 2: 18:10 Yeah,

Speaker 1: 18:10  It’s a very similar query. And another thing I want to show you is the zip link visualization tool. So it’s very easy to compose a visualization tool, the data visualization using the zip link and you can see the seal query and you can run it and talk to the Presto cluster, which is just constructed. So those are the very fancy visualization tool that Apache Zipline provide. You can see the single query we have and then you can just run it and it will give you the result very quickly and plot the chart. So I think that’s my part. And I think I can also show you another part I mentioned earlier is the press SQL is the Ian notebook I mentioned earlier where you go,

Speaker 2:  19:01 Sorry,

Speaker 1:  19:02 Give me one moment.

Speaker 2:  19:05 So press s. Yeah, it’s here. Sorry.

Speaker 1:  19:15 Yeah, this is a notebook. So in a notebook you can see we actually use the Python SDK, the Presto, Python, SDK. So it can easily help you to connect to the Preto import library connect space by the server and you can run the query directly in the notebook. So yeah, you can see it just query the result back. And here is the same query I issue on the single client ey it just try to do the query query. And then here because want to date more how I can actually inject the data into the iceberg table. So I try to clean out the table and in here you actually do the February query and inject the data into the iceberg table. So the last query is I try to query those data directly from iceberg and iceberg underlying is using the pocket type file format so it can easily hand over to the data scientist.

20:15  Okay, so that’s back to the chat. So here are the information that help you to set up the environment that we set up. Starting from the Presto, here are the Presto repository and information and a lot of connectors that you can set up connect to and Presto helm chart that we use to set up the Presto on the Kubernetes cluster and also Apache iceberg. And we also have two actually workshop lab that we conduct to help a user to learn the Presto and also the Presto plus icebreaker. So these are two workshops that you can easily attend is free and publicly available. And he has the step-by-step in structure to help you to learn how to use the Presto, how to set up Presto and how to set up the icebreaker with the Presto. And I think here is I finished in my par, so I will hand over to Ted. He’s going to share the solution for ML Lifecycle

Speaker 3: 21:18  Everyone. Thanks VO for the data challenge. We just talked about how Presto can be used to solve some of the data challenges. The other challenge remaining is the automating the ML life cycles. Because after you turn data into features, typically the next step is model training. And then you may have to do hyper permit turning to optimize the model. After the model is trained, you will deploy the model, but model accuracy may drop, then you will need to start the whole process from the data again. So the heart of the Q flow is the pipeline or it’s called KFP. We can use the pipeline to break the MLI cycle into smaller tasks or boxes. The line and arrows determine the order. Each task is processed like flow chart diagram. Each of box may run containerized tasks like data reprocessing or training or deployment models. Some of the tasks can be handled by the prebuilt components. We can use Python DSL or Jupiter Notebook extension to compose pipelines. And once you have the pipeline, you can schedule to run it automatically.

22:50  Q4 pipeline is a metadata driven pipeline. Internally the pipeline backend storage runtime information of a pipeline run in a metadata store. Runtime information includes status of tasks, availability of artifacts, custom properties associated with each exclusion or artifact. Metadata store also enable pipeline step caching, which means if arguments are exactly the same, a task will be skipped and old output is reused. So let’s look into the Q flow core component and their usages. So first one is a Q flow notebook. It provides web-based development environment inside your cube net cluster. It provides an interactive environment for you to implement any task like data analysis, training, and deployment tasks. You may also import additional notebook image which provide other functionalities like local or no code to compose a pipeline using drag and drop approach. And for hyper permit tuning we use kib. Once you have written your ML training code, you can parameterize it and kib runs trails to figure out the best parameters to run the learning process. KIB also has Python St K to create Kat Experiment. Pipeline also has prebuilt components to run ment. So for model training we use Q Flow training operator. It is used for fine tuning and distributed training of ML models. IT both ML framework such as PyTorch, TensorFlow, tributes and others. It allows you to deploy ML training workloads using Python’s, DK and custom resources. And last one for deploying models, K server, standard model inference platform and Kubernetes. Built for highly scalable use cases. It provides standard inference protocol across ML frameworks. It offers production ML survey including prediction. It can also be used for monitoring drift bias and so on.

25:08  So let’s take a look, a quick demo, which I have recorded earlier because maybe the network may small, slow. Okay, so let’s see how we can create a simple pipeline using Q4 notebook. So I’m using a notebook with ELI R extension as mentioned before, it offers NOCO or low-code approach of creating pipelines. I have three notebook which I already created. One for data aggregation, one for load and training models, another one for serving. I drag and draw the notebook into the canvas and then connect them with lines so they can run in sequence. And then I click the play button to execute the pipeline using the Q flow pipeline. And then I can go to the dashboard, I can go to the Q flow dashboard to see the result. Oops, what is zoom thing?

Speaker 4:  26:46 Switch back to the video. The video.

Speaker 3:  26:49 The video. You lost

Speaker 4:  26:50  The video? The video? Yeah, the video.

Speaker 3:  26:58  Okay, here

Speaker 2:  26:59  You go.

Speaker 3:  27:01  Yeah, so once we execute the pipeline we should be able to go through the dashboard and see while I was running running we can see the output of it, like the logs, like the events and metadata and so on. And once the pipeline finished, we should be able to use a case serve center inference, API to get the result of the prediction.

27:36  Yeah, that’s it for the demo. And so we are using Q Flow has many distributions, right? We install the IBM cloud distribution on standard Kubernetes cluster with additional components like Kerv and EL IR extension. Moreover, each cube flow distribution may package different components. For example, the OpenShift AI is based on the OpenShift Red Hat Open Data Hub distribution. It offers additional open source component and SDK for running large language model workload. They also supports a third party data ops service for running large queries. The IB and Watson X AI is built on top of the OpenShift ai. So this is it for our talk. So to summarize, we talked about solution data challenge with Open Lakehouse using Presto and Iceberg. We also talked about solutions to ML lifecycle challenges with Q Flow. We demo you how to create a simple pipeline to automate these solutions. So thank you for attending and hope you enjoyed.

Speaker 1: 29:01 We just got information. We have five minutes for the questions, so if you have any questions just but it is so bright I actually cannot see.

29:16  Yeah. So yeah, just like we mentioned earlier, those are, we actually, because we went through a lot of different open source projects, so we think it’s actually really interesting to bring all of them together. So from the data all the way to the ML pipeline deployment. So it’s very important. Most of the time when we look at the M solution, you only, for example, for the data part, maybe you only contain the data, but for the ML part you only contains the model training or maybe the deployment only. But we definitely need a solution from end-to-end starting from the data and manipulate the data and then hand over to the data scientist to do the model training and position it to compose the model. Yeah, so we think it’s really good. Yeah, I saw it. Yeah.

Speaker 5: 30:19  Features like a feature store kind of a thing. Have you thought of something like a feature store where data is a special meaning and special use case because it is going to an ML model because it comes with its own special training and serving

Speaker 1: 30:34  Needs? Yes, I think it is a good question. And then there’s also the other reason we actually pick up the Apache iceberg as the table format because it actually support a lot of other feature stores. So you can easily connect to other feature stores that you choose that you can easily integrate into our data source. So for example, when you do the data reprocessing and you can using, in our scenario, we have the MySQL, we have the MongoDB and data may be also from some S3 storage. So you can use the Presto to have this kind of federated query and then you can dump into the iceberg table format and then you can have your feature store and talk to the iceberg table format and retrieve the data. Yeah,

Speaker 6: 31:37  I was wondering about your data gathering. Are they all custom bespoke solutions or do you have some sort of Flink ingestion engine or a Spark ingestion engines that you run to gather your data?

Speaker 1: 31:53  That’s a good question, but because as you know, pat and I, we are a developer and we contribute to those open source projects. So we actually try to find some good scenario and also dataset. But the question you have is that do we have any scenario that do the data injection? But we don’t have that because in our scenario we just try to compose a platform and provide other users to use it. So if you have any good data injection scenario, definitely yeah, you’re welcome to share with us and we can try that out on our platform. Yeah, so I think that’s it. So thanks for attending again, and yeah, those are the information we share and we’d like you to try it out and also maybe give us feedback. Thank you.

Data on Kubernetes Community resources