The CFP is open for DoK Day at KubeCon NA 2024 through July 14.

Submit Now!

Advanced CSI-FUSE Filesystem for AI/ML Data Management in Kubernetes

AI/ML workloads, known for their data intensity, often depend on cloud storage. However, Kubernetes faces significant challenges in accessing this cloud storage data efficiently, primarily due to the lack of a Kubernetes volume interface and a heavy reliance on object storage-specific APIs. In this session, Lu Qiu discussed how to use an advanced CSI-FUSE filesystem to address the above challenges. She also analyzed various industry-standard CSI-FUSE products to guide users in selecting the most suitable CSI-FUSE solution for their specific Kubernetes AI/ML workloads. Other key learnings:

  • Using CSI Driver X FUSE Driver facilitates access to cloud storage data, mimicking a local filesystem and thus eliminating the need for specialized APIs
  • Using FUSE driver caching capabilities to boost performance for I/O-bound AI/ML applications
  • Managing resources effectively by aligning the caching lifecycle with application requirements.

Speaker:

  • Lu Qiu – Active open-source contributor, Alluxio

Watch the Replay

Read the Transcript

Speaker 1: 00:00 Hi there. So today we will share some topic about how we match data in ai, especially in Kubernetes. We’ll introduce two solutions together. One is the CSI fields, the one that you guys already see in the actual schedule. And the other one is a newly called one. It’s called E-T-C-D-F-S back the name countability. A little confusing in the beginning, but I will go through them one by one. So first, introduce myself. So I’m Lou. I am the open source PNC maintainer of SIO previously named TAT Young. And I also is the AI platform T lead at sio.

 00:40  So we’ll quickly go through the AI data management proverb in Kubernetes and we’ll introduce the two solutions, the CSI fields and the E-T-C-D-F-S back. So quickly go through the RBS like Kubernetes, how is the deployment headache? But for Slater scientists, they still need to consider so many problems other than AI logic, like where’s my data and why my data is so slow, why it costs so much money, et cetera. Then how can we get away all those storage overheads from data scientists and let them focus on AI logic more than before? So we’ll talk about the first solution, the CSI Fields. So first we will talk a little bit about what fuel fields have a kind of magic title. Basically, it can turn your cloud storage into your local file system format considering you are opening a folder in your local Mac. It may look like just a folder or file in your local file system, but it can be backed by some cloud storage like S3, or GCS.

 01:50 So it has a much bigger capability than your local disk, but you can use it just via your local disc. So all the data scientists, really like this kind of logic. They think that it’s wonderful, but everything is not without a cost. So one cost that comes with fuel is that it will need more resources, CPU memory, and disc because it will launch some of the kernel threats to help deal with the operations. And also you will use some of the kernel cache and also some disk for local cache performance. So all those come with some resource cost. Then our data science, because saying that I only need the few, the data set which represented you are using an FPO in my AI training logic, if I don’t need it, can you just not launch it?

02:43 So CSI makes it really good that it can launch the fuse nodes when the data set is needed. So FPO deals with the storage logic while the application part deals with the AI logic. But CSI maintains the same lifecycle between these two. So the first part will only be launched when your application needs the data set. Everything sounds perfect, but data scientists have another question like why my data assets are so slow. As we all know, assessing cloud storage data is not your local speed. But when data scientists are assessing your local data that is fast sensing similar to your local disk performance, then come with how can we deal with the slow performance. One solution is local cache, like different fuel solutions like S3 FS fields. They have their billing local cache. You can also use others like GCF fields. They also have their billing local cache to solve this problem. But sometimes the AI application, you actually need data to be shuffled and shared between those this comes out with the need for DGB caching.

04:00 So on the left side is the training node. On the right side is the storage system and we kind of break the logic between the training node and the storage system, like two parts sense for fields which can turn cloud storage into a local folder for your training notes. And in the middle, we can add another light-distributed catching layer in between to provide high-performance data assessment. So in this case you can assess the cloud storage data like the local data while also enjoying the performance benefit of counting from the distributed caching system. And we’ve done some of the benchmarks like the CV data loading between also and SS fields and also bot S3 API SIO is five times faster than S3 FS fields and more than 10 times faster. But you use bot three S3 API directly and sometimes our user will show us some graph like 10 support showing how much data is actually used in data loading and how much time is actually used in the G utilization. So the higher the time spent in the data loader, it could directly result in a low G utilization rate. With ELAs fuel solution we can reduce the data load rate from 82% to 1%, which directly improves the G utilization rate from 17% to 93%.

05:28 This sounds amazing, but everything comes at a cost as I mentioned. So there are two problems with fuel on Kubernetes. The first one is that you actually need to have a separate container for your fuel part because it costs resources and logic is readily complicated and on the other part, it needs you to have the system, the security privilege. Many Kubernetes fall out to me that we don’t want to allow you to have the system’s main privilege because we don’t know what you will do in our cluster. We don’t want you to ruin the host machine on other things. So these two are overhead for us to maintain fuel on Kubernetes.

 06:10  Then we jump to the Nest solution about E-T-C-D-F-S back. So first, what is FS back? Have any of you guys heard about FS back please raise your hand please. Oh, there are some people here whom I’m really surprised anybody of you have heard about. Arrow please your hand, please. Oh, much more hands. I can see that. Thank you guys. So starting from Arrow, the arrow connects different data formats. You can read from one data format in one line in Arrow and then write the data format to another data format in one line. Like basically write from Arrow, write data format. On the other hand, Arrow connects different frameworks together. You can read data from one framework and write it to another framework only simply by several lines of code, which makes everything connect together. But when Arrow needs to deal with the story re-system, it’ll go to FS bag.

07:12 IT basically delegates all the cloud story operations to FS back. FS back defines the file system interface for Python and is kind of learning from the two frameworks. One is Python io and the other one is FS which we just mentioned before. But compared to Python IO and Fields is interface is much easier. It especially made it so easy for the cloud storage vendor to implement it. So for example, if you want to implement a storage system just for me, you only need to implement publicized three APIs, list status, gap files, status, and root range. Basically, that’s it. So it’s really simple to implement from the engineer’s perspective. And also we have different FBA implementations on the market as well like S3 Fs, Azure, fs, GCS, Fs, hugging phase Fs, nearly all the storage vendors you can consider off compared to fuse. It’s easy to implement. It’s designed for cloud storage. It’s easy to use. It does not require Azure container or system admin permission. And talking about performance, it has the same issue that’s similar to fuse. It needs to have some kind of local cache to be able to provide faster data assets and it also is not designed for AI workloads that with data shuffling and sharing logic.

08:39 So we bring out our cloud native distributed caching system, which is our new generation of distributed catching system that was built for Kubernetes. On the left side is the F sba, which is the client side. On the right side is the assistant cache with the servers and for REIT we will first chat, okay, where our data is local. If the data is already cached locally, that’s great, we can get the data local and if not we will go to the distributed caching service. Then we will need to know, okay, I have so many Alas workers that probably catch data for me, which worker is more likely to have the data cached there so that I can get my data at a faster speed. So it includes some kind of worker selection logic to be able to select which worker we need to get some information on those workers and that worker membership information is still in E-T-C-D-E-T-C-D responsible for periodically getting the worker list from all the workers, basically saying hi to the ETCD and then ETCD provide those membership information to our clients periodically.

 09:49 And based on those worker information we’ll create a consistent hash ring so that we can know that oh worker zero is more likely to have the cache data. So we go to worker zero to serve our request of a certain file and if anything happens, like if the data is not cached, if the ring goes wrong, if never have some issue, or if you guys actually not doing read you guys want to do right, then all those operations will go directly to the underlying FS spec like S3 Fs, GCFS, hack phase Fs. Basically we add one layer of distributed caching logics on top of the existing UFS, like the S3 GCS, et cetera.

10:37 And one question that many people may ask is is the distributed caching system that why cloud native SIO previously have the previous version of sio? And when we actually run in Kubernetes cluster, many KU folks do complain to us it’s like we don’t want something stable. Why stable? Can we do a stay list? Our whole Kubernetes cluster does not allow any staple data system and that kind of may us shut because it’s a catching system. We think it’s a data system and data needs to be sit there persistently. But consider from AHA is the caching system. The data is able to lost just as your kernel cache data can be lost, it can be added. So we developed a new system that builds for state lists. So consider some of the workers here you have a commence cluster, and many service share the same cluster.

 11:34 And sometimes when the training jobs gets higher priority, we want you to give in more catching resources, but sometimes the other jobs gets higher priority. We may want to kill the whole SIO system cache or we may want to scale it down or maybe several nodes are under maintenance. We want to move it to other nodes that in this case for this architecture, if you do change to the SIO system catch it will affect temporarily the catch heat rate. Basically the newly launched worker, doesn’t have the privileged catch data and because the caching cow shifts a little bit, it has a little catch he missed. So in that case it affects a little bit of performance, but as time goes by, when you catch all the data, the performance will catch up. So that’s the tradeoff that we did for our users. They can decide where they want to get data so that the data is more stable or they want to go stay so that it gives more control to the Kubernetes logic like scheduler and all other resource management logic. And on the other hand, this architecture is highly fault-tolerant and highly available for all everything that Chris fought. We basically fall back to the underlying file system. The underlying file system is always the source of truth. Even if the whole elastic system catch is queued by the Kubernetes cluster, you are still able to serve requests. Basically, all operation will fall back to the underlying file system.

13:10 And now we will talk more about how to use this kind of architecture in the real world. So one example is using Ray with ELAs. So we may talk a little bit about Ray. How many of you guys know Ray, please use your hand please. I can see some similar force raise their hand again and other folks. Yeah, the same. So Ray is designed for distributed caching. Ray basically uses a distributed scheduler to dispatch changing jobs to available workers and it enables the seamless horizontal scouting of changing jobs across multiple nodes. And it has a really good part that it can parallel your data loading data and the training logic. So for example, when you are training on partition two, you can partition one and then load in the partition zero, something like that. So it can do those things parallel to fully utilize your GPU resources and CPU resources.

14:12 But it also has some performance and cost implementation of Ray because for example, when you use Ray with PY touch, when you’re doing the AI training in each airport, you need to load the whole data set and it’s reusing the whole data set at the cluster level. But on each node level you will, sorry. So basically for each apple you need to reload the whole data set again because really doesn’t tell you to catch the data. It only matches what data is needed to be used by the immediate task. And also you cannot catch the hottest data among multiple training jobs optimally. So basically you cannot have some data-sharing logic and you might be suffering from a CoStar every time. So we go into the race lab channel and do find some of the user headaches about their data reuse problem, especially when they are doing a large amount of data.

15:12 And so in the framework, like in the Ray ecosystem, Ray is the unified compute, the machine learning pipeline observation and it operates different frameworks together like some data processing framework and some of the training and inference framework like prior touch and TensorFlow. And so it originally directly loads data from the remote storage. Andela asked us a high-performance data assessed layer between the compute layer and the storage system to provide better performance for the compute job. And the usage is also pretty easy. You create the ELAs file system and then you read data using the original S3 URL or original U-F-S-U-R-L and you say your file system. So the usage is price simple.

16:05 By using this framework, especially with the large Parquet file, we are able to achieve two tons to five tons of library performance compared to the same region S3. And it not only improves the performance but also reduces the data transfer costs. Like because the data is cashed. So originally the large highly concurrent AI wall loads will directly heat to your story system, and for your story system, it may just say hey you guys see my data assess rate. You cannot do that much reading in that short period of time. It may just error you out or they may say that, oh we only can give you certain through pool latency. So that’s one thing. And the other thing is about the data transfer costs both in terms of the cost of storage, like the vendor trust cost, and also the people that managing the data transfer.

17:06 And on the other hand we found that there are many of redundant API costs like the S3 operation for list status and get file status. So for the storage system, we usually found that if even you just read a really small file probably really quickly, but to able to read that small file you need to have at least three me costs. So consider you are reading someone like an image now with 1.3 million files, maybe the recall is 1 million, and then the medical call can go out to four to 5 million tons and that is a huge rate for the story system. They may be under heavy low or they may give you bad latency. And that’s it for our talk. I think we still have several minutes to answer any question and feel free to scan the QR code if you have any follow-up question. We have all the engineers on our Slack channel and I’ll have other and also learning materials on the line. Thanks so much. Any questions?

Data on Kubernetes Community Resources