The CFP is open for DoK Day at KubeCon NA 2024 through July 14.

Submit Now!

Cloud-Native Dataspaces: Experiences from the German Research Data Ecosystem

Data on Kubernetes and stateful applications have gained remarkable adoption across the community. But why stop there? Kubernetes and cloud-native tools can provide compelling core technologies for building sophisticated data ecosystems, from advanced metadata handling to workflows and events. Enter the realm of “dataspaces,” a transformative concept empowering organizations to seamlessly integrate and synchronize data sharing patterns for diverse existing data landscapes, that can even extend across organizational boundaries. Our session gave practical examples how Kubernetes and open source tools can be used to harmonize a heterogeneous data ecosystem. The resulting cloud-native dataspaces increase interoperability, reduce vendor lock-ins, and create an overall boost in operational efficiency by enabling the use of modern DataOps practices.


Watch the Replay

Read the Transcript

Speaker 1: 00:00 Okay. Hello everyone. Sorry for the little technical difficulties today I want to talk to you a little bit about cloud native data spaces and our experiences from the German research data ecosystem in building such ones. My name is Sebastian. I’m a cloud researcher and bioinformatician from Yu Lip University in Geen in Germany, and we are dealing as a bioinformatician. I’m dealing with quite a lot of data. As Eric Schmidt almost 15 years ago set a famous quote. There were five exabytes information created between the dawn of civilization through 2003, but that much information is now created every two days, and this was 15 years ago. So most people always talk about big data, but some people don’t really know how to define them or how to characterize them. Some people characterize them with the three Vs. And the three Vs are velocity. So data is generated at a very high speed.

01:08 Real-time data gets even more important with machine learning and AI solutions and high throughput data is more common than ever. And then there’s variety, especially from a research perspective. There’s quite a lot of variety. Data can be structured or unstructured or seem structured and all of this plays a very important role in dealing with data. And there is the elephant in the room, the volume of data, I guess it’s estimated by the EDC that this year 120 to 160 zetabytes are globally generated about 25% increase per year from the field I’m from where genomics data plays a heavily role. The NCBI estimates that data is approximately doubling every seven months, so it’s even worse in research. And then there’s some additional requirements. The data act as an elephant in the room, again, that states that data sovereignty is quite an important topic and data privacy and having fair access to data is also quite important.

 02:24 And also the world is increasingly interconnected. So data from different sources need to be put together and need to be analyzed together to make certain sense of certain cross domain analysis. And so there are quite a few initiatives. So I’m a research person, so there are some scientific initiatives. For example, there’s the European Open Science Cloud where they try to do research with a common cloud infrastructure funded by the European Union. There is a German initiative research data initiative, NFDI, where certain consortia try to build structures and infrastructures to deal with data for a specific topic. And there are industry initiatives, especially from the European Union. There’s Gaia X, which is quite prominent, which tries to do a sovereign data infrastructure for the eu. And I’m working on a project that’s more or less an interconnection between the two. It’s called fair Data Spaces and we try to build bridges between the science side and between the data side from the industry side.

03:36 So what is a data space? And I think data spaces were defined multiple times and there were multiple phrasings around that. But the best definition currently available is the open DEI definition, which states that a data space is defined as a decentralized infrastructure for trustworthy data sharing and exchange in ecosystem based on commonly agreed principles. And there are three important points here to take. So the first one is decentralization. So we don’t want to have one big pool where we throw all data from everyone in and then analyze them. So we want to have the data separated to where it’s generated. This has benefits but also drawbacks. When you store the data closely to where it’s generated the data producers have full control over access and can decide where and who and what gets access to that data and under which conditions that access granted

 04:42 This also enables a very heterogeneous ecosystem. You can have different file formats, different storage formats, different storage methods. One person uses S3, the other person uses a standard file system and you can integrate existing infrastructures, which is quite hard when you have a very diverse infrastructure available. And this also enables multi-vendor operations. So you can have one participant from Google Cloud and the other one from AWS and integrate both of them. The main drawback here is it is a lot of overhead to synchronize all these data sources across different platforms, and there’s also a lot of overhead regarding traffic. So ingress and egress is always a problem regarding this because you need to ship around more of the data between the different locations to make common analysis possible.

 05:37 The second thing I talked about was trustworthy data sharing. To have a trustworthy system, you need to have certain kind of governances, so you need to agree at least on the common ground on some principles that can be used or should be used by all participants of a data space. And this can be very heterogeneous and this can also have a common thing for everyone and sub data spaces or subsystems for certain domains that want to have additional requirements fulfilled for them. The second thing is authentication. So you need to have someone or you need to know who you’re dealing with and you need to know, everyone needs to know who the participant is they’re interacting with. And the last thing which is quite important is sovereign authorization. So the authorization part should be done on the data producer side, on the data owner side. So they should decide who gets access and not a common cloud infrastructure that decides that for them.

 06:45 There are certainly a lot of principles available you can choose to have in our case, which is from a scientific perspective, quite important and also from certain governmental aspects are very important, are the fair principles. There was a very famous paper around six, eight years ago where Mark Wilson announced the fair principles of data sharing and fair stands for findable, accessible, interoperable, and reusable. Findable means you need to know where the data is, at least you need to know who to contact to get through the data. Accessible means there must be a way in certain standards where everyone agrees with to access the data. It doesn’t necessarily mean that everyone gets access to every data. It doesn’t need to be open data. It could also be restricted access or somewhat confined access. Then there’s interoperable. So you should use certain systems that are inter with other systems and use interfaces that are quite popular and heavily used because otherwise no one will participate. This is the main challenge here. And then there’s reusable. When we think about scientific data, we want to reuse the data in 10, 20, 30 years and maybe with a completely different research question. So we need to make the data available and we need to express the data in certain ways that they can be accessed in 10, 20 or 30 years.

 08:21 And now I want to talk a little bit about what does cloud native tools or which cloud native tools can help us building such systems? The status quo on our side was something like this. Before we started building something, we had different domains. We had some biodiversity domains that had some environmental data and data about plant locations from certain endangered species. For example, we had genetic data from certain sequencing facilities, and we also had some industry partners that had different data from all over the world and they all use different standards. One used the file system, the other one used an object storage, the third one, loft, FDP, whatever, whatever. And so Aex assisting the data and sharing the data is quite a big challenge because you need to be compatible with everyone. And so we had a plan and we tried to build something that makes this easier, especially for researchers and participants, not necessarily for us. First of all, we constructed its governments between all participants and we agreed on certain principles there should be or what was agreed on to have a certain limited set of metadata. So you want to have a title, an author, a description, what is the data set about some labels similar to Kubernetes labels and some technical information. How big is this? Which five format is this? Where can I look for it? And all this kind of stuff. And this can also have a broad overview about the conditions, how access can be made possible.

 10:09 The next thing was we agreed to choose S3 as our common interfacing language to exchange data because when we made a poll and we asked everyone, most people already know how to deal with it and we don’t want to build our own clients and all this kind of stuff. We wanted to have a certain language that everyone can integrate very easily in their systems. And there are some negotiations, for example, for contracts, there is ODRL, which is from the semantic web community, the open digital right language where you can express and agree on certain conditions and contracts. And there is the International Data Space Association that also has some principles and some standards already defined how a data space can look like and how a data space can be built. From an architectural standpoint, we choose obviously Kubernetes, not only because it’s quite widespread adopted, so many people know how to deploy Kubernetes cluster and almost every institution has some contact with this when they do anything only slightly cloud related.

 11:24 And the next thing we choose was we wanted to have one database that at least stores the metadata between all the participants so that everyone, the metadata we don’t really see as something restricted. So most metadata is publicly available and everyone can search for the metadata. Having access to the data is another topic here. We have a central search index where everyone can search this metadata database between our locations and where the data can be found. And lastly, there’s authentication. So we wanted to have a certain system that agrees on authentication. When we look at Gaia X and we looked at our research data infrastructure Open ID connect was the obvious choice because everyone is using this. And so it was the easiest choice for us to integrate such a system with Open ID connect. And then we expanded this with certain policy solutions and we choose here to have either open policy agent or common expression language, which is developed by Google to have an easier interface to develop certain easy expressions if someone gets access to the data or not.

  12:39 And altogether, we call this a fair and domain agnostic data space because we want it to be not domain specific. There are lots of data spaces available for mobility and for health and for automotive perspectives, and they all have certain standards agreed on for their specific domain. But having this open to a broader audience was quite an important topic for us because our partners are very diverse and we can’t really fit all the partners in one schema and our architecture looks something like this. So every domain deploys their own Kubernetes cluster. This cluster has a part of the decentralized new scale database and they exchange the data via common standard and build a multi cluster service match via SST o. This way they can exchange the data at least the metadata over the SEO O interface, and they can choose which parts they want to give access or not.

 13:47 And this can be built multi-tenant, it’s quite hard to do so, but it can be built multi-tenant and every domain connects their own data sets via connectors to the whole system. And the backend where the data is stored can be quite diverse. So it could be still a file system or an FTP or whatever. And so everyone has their own data and when we want to make a transaction, it looks something like this. In domain C here we have a quite prominent example. There is a construction company that wants to do a large construction, and when you do this at least in the yield, then you need to have a certain assessment if there are endangered species in the place where you do the construction or this kind of stuff. And so someone comes to us and says, okay, I want to have data over a certain endangered species.

14:42 For example, the plants that are stored in domain B. So he asked the system and the system answers, okay, you need to ask domain B. They know more about plants and about specific endangered species of plants. And then it starts a exchange challenge with domain B. It first starts with a request, can I have the data? And afterwards, most of the time, domain B answers with certain conditions that need to be fulfilled. For example, you can have access when you pay certain amount of money, you can have access only a limited timeframe and only for your specific construction project. And you can’t share the data with anyone else because data about native species is quite sensitive because when you publish this to the open world, then there will be no native species after a while because everyone is looking for the data and for the species.

 15:42 And when the conditions are met, the person can do, can have a decision and can say, okay, I want to have the data under these conditions we agreed on. And then there’s a contract made and formalized where everyone can sign and then they both can use the contract to get access to the data. This looks something like this. There is another request now to the data provider, so to the proxy component that stores the data or manages the data. And then the proxy component internally evaluates is there a certain contract in place? And when there is a certain contract in place and all the conditions are met and it’s deciding, okay, you can have access to data, the person receives the data, all of this is built into an event driven architecture. I think everyone wants to have something like this because when you want to upload data or when you put data in such a space, then you want to have some automated systems that do validation, transformation, evaluation, whatever.

 16:49 And we use for this nuts as our message queue. So nuts io as our message queue every action on every data set and every start object triggers the message and afterwards we can use these messages to trigger certain secondary workflows, validation processes and all this kind of stuff. And for this, we use Argo events as our bus to integrate all the other storage methods. Then we have some workflow integrations. So via Argo events, we get a lot of workflow integrations for free. We use Argo workflows for example, because it’s quite a good fit for Argo events. But we also use some serverless stuff like Open Wisk or our own HPC infrastructure via slum.

17:45 And all of this I’ve talked about is something like this. The base layer, what I haven’t talked about much is about the cloud management layer and about the interaction between the different participants, how they build a peer-to-peer network to exchange data in itself. But they are all interconnected with each other and they share the data and you can say, okay, I want to have three replicas of my data in different locations when everyone is agreeing on. And this way we can have a certain framework, we call it data orchestration, similar to Kubernetes course, it’s container orchestration where we have data on different locations and we can optimize this on different perspectives. All of this is used to build secondary products on top of this. So this is not mostly not the user directly interacting with this, but there are portals and user applications and services for transformation validation, mediation and all this kind of stuff.

 18:46 And also secondary semantic databases that are specific to a certain domain to annotate additional metadata for certain data sets. The benefits of this is the whole infrastructure side is completely abstracted away. Most researchers just need to know how to deal with Kubernetes, or most researchers just need to know how to deal with S3 more to say. And the layered approach allows to have everyone full control over their data and the data is stored decentralized where the data is more or less generated. And for a multi-cloud operation, this is quite beneficial and the system itself can optimize for certain aspects like environmental impact or data privacy regulations. So you can only process the data when you are in the EU or in Germany or whatever, or you can only process this in certain cloud environments because it’s cheaper for us to do so in the long run.

 19:45 This also enables reput usability and reusability of the data itself because the data is more or less somewhat structured with a basic set of metadata. And there are also some challenges. So I think the main challenge is the human part. So getting people involved is the greatest challenge for us because having trust between participants is not that easy because no one wants to agree on certain principles. I think we made good progress there, but having trust is quite a problem here. Ensuring confidentiality and security is also a problem because we need to rely on certain centralized infrastructures. And mediation between different metadata formats is also quite tricky because everyone has their own ontology and having multiple ontologies is quite a problem. And also the legal side is also another thing. This is more or less country specific, how the legal applications are there. We have some people in our project that are lawyers and deal with this, but it’s quite important and tricky to have a system that can be used all over the world.

21:05 Okay, let’s come to the Outlook data spaces enable sovereign data exchange between participants. Cloud native tools can be a good choice to do this because they are quite heavily involved and heavily integrated in existing systems. There needs to be future open source work to build the glue around this and to integrate this with each other. We are currently working on our open source solution where we try to build this, it’s called aruna and it’s heavily work in progress. It’s not really finished yet, but we are building this to make it easier to integrate different solutions. And it’s an orchestrator and it’s a data connector similar to what Mineo has done a few years ago and then deprecated it unfortunately to integrate existing datas solutions. When you want to give feedback, there is a QR code to give feedback. There’s tomorrow also a poster session where you can attend and we can have a chat about all this kind of stuff. And when you want to visit our open source project or want to participate, feel free. We are always happy to have participants and people that want to build such futuristic data ecosystems. Thank you.

 22:31 Questions.

Speaker 2: 22:37 Thanks for the talk. Have you considered using open standards instead of proprietary protocols such as A MQP 1.0 instead of nuts or find some replacement for S3, which I think is proprietary as well, all to adopted by many projects?

Speaker 1: 22:57 Yes. We consider this for our message queue. It was more or less first an internal thing. S3 is somewhat, the protocol itself is not really that open source, but there are so many open source libraries to use it so that we consider it to be open source. And the problem was we needed to have a standard where everyone can use right now and where everyone can participate right now. And so data exchange standards are quite tricky to have when you want to have the whole full stack. But yes, we talked about this and we can also evaluate other standards and this is just what we have agreed on. When some people build their own data space, they can agree on different other standards. Another question,

Speaker 3: 24:01 Thank you for the talk. You said use S3 or object stores as a standard. That’s a technical one is have you had any plans to standardize more in the level that you say, we use specialized data file formats or something like that, and if not, how do you ensure interoperability between the data spaces?

Speaker 1: 24:27 Yes. We are currently integrating, also there is the International Data Space Association standard to exchange data between different data spaces already in place. So most Gaia X data spaces are interoperable with this standard. And the problem is we use, as I’ve said, we use S3 because it’s widely available and everyone has a client for this and we don’t want to use our own client for this. But yes, we are always thinking about new formats and data storage formats. I don’t know where,

Speaker 3: 25:05 I mean you can store a CSV file in an S3 bucket as well as a parquet file or even a data lake or Apache iceberg or something like that.

Speaker 1: 25:16 Yes. The problem is our data inputs are very heterogeneous and don’t really fit in these formats that well. There are pictures, there are videos, there’s all kinds of stuff of data available, and we don’t want it to be restricted on the data format itself. So we wanted to have a format where everyone can put their data, whatever the format is they have currently in place. This was our focus there because we wanted to be as agnostic to the data as well. But this has drawbacks, obviously this has drawbacks, and currently our solution for this is to have additional abstractions on top of this, that label this as a certain format or certain specification, and then ingest this in certain secondary systems that make them available via specialized engines like Pke or something like that.

Speaker 3: 26:07 Okay. Thank you.

Data on Kubernetes Community resources