Published January 20, 2022
Smart Cities represent complex and challenging environments that generate an overwhelming amount of data to ingest, transfer, prepare and store.
In this presentation – which is based on a fictitious but potential real world scenario in the city of London – we will see how you can leverage different data engineering patterns to create a fully automated edge-to-core data pipeline for interesting use cases which include license plate recognition and facial recognition.
This Smart Cities use case integrates computer vision, Apache Kafka, S3 object storage, data aggregation, and real-time and batch analysis in a fully cloud-native containerized environment.
Attendees will develop a better understanding of how to run AI/ML models in production, how to automate data engineering workflows, and how to build a resilient production-grade data foundation to support their business use cases.
0:01 Karan Singh
Hi, I am Karan Singh, senior architect at Red Hat. In this presentation, I will walk you through the data engineering jumpstart library, an end to end functional use case to jumpstart data engineering productivity on top of OpenShift and Kubernetes. Data engineering Jumpstart library provides data engineers with some sample working solutions to familiar industry problems using a modular pattern approach. This is all deployed and developed on Red Hat OpenShift. In this, data engineers will get a GitHub repository where you will have all the workshops and guides that can help you deploy a specific solution on OpenShift. The repository contains a functional demo, Python glue code that you can always change, some OpenShift YAML files which will help you deploy the components and tools required for the demo to work. This also contains some sample data, and some sample pre-trained machine learning models. For the enthusiasts out there who are interested in automation, we have also packed in some automation deployment scripts which will help you deploy these solutions really quickly. The intention is that you will get to do some data intensive work on top of Kubernetes. These are some solutions or examples for helping you deploy on it. It’s always better than a HelloWorld YAML file or container running on OpenShift. You will get to work on a lot of different tools using this data and Jumpstart library. The overall vision for the Jumpstart library, at least for this year, is to cover multiple different industrial verticals and to identify some use cases and how we can solve it using this technology. For healthcare use cases we have identified X ray diagnosis. For government use cases we have identified Smart city and green city and for industrial right now we have completed factory floor predictive maintenance. We are also working on some more industrial vertical use cases like financial and oil and gas, but those are yet to be worked on. Typically we start with a business problem, we identify the business problem and address it by building an end to end solution. For healthcare we have identified a few business problems like how we can help faster pneumonia diagnosis, how to focus on practitioner’s time and how to increase case throughputs. These are some initial business problems we have started with and developed the solution. On the other industrial verticals like government Smart city, green city we have identified how we can reduce the congestion and pollution in the city, how we can locate some wanted vehicles. For industrial use cases we have identified some business problems like how we can reduce factory downtime, how we can improve the product quality and how we can reduce maintenance costs. With these problems our base built a solution on top of OpenShift using a lot of different tools.
While we are working on different varieties of industrial problems and industry use cases, we have identified one thing, there is a very common pattern that emerges across the use cases. These are some modular reusable patterns that we see in each solution. For example, triggering a Kafka event when an object is ingested into an object storage that could be an event or maybe triggering a serverless function as soon as the Kafka event is created. These are some modular examples of some modular patterns that we have seen which are common across industry verticals, and we are pretty sure that as we keep on adding more use cases, we will see these modular patterns coming out that we can deploy across different use cases. For the sake of understanding you can consider these as Lego bricks. You can combine, mix and match with this difference model to build up a solution that works for you. I’m going to describe these models later in this presentation.
Next, we’ll go through a Smart city data pipeline demo, which represents an edge to core machine learning solution on top of OpenShift that uses Kafka and a lot of different tools.
Let’s understand the business requirements for smart city and green city. As I mentioned before, the requirements were how we can reduce congestion, pollution and locate some wanted vehicles and notify government officials in case there is a stolen vehicle. We can charge an extra fee for a vehicle that does not meet emission standards.These are fictional scenarios that we have created and have built up a solution on top of it. What you’re going to see here is that we have solved this using multiple different open source technologies, the Kafka, Ceph, Superset, Open Data hub, obviously Kubernetes because it is all deployed on Kubernetes and OpenShift using Ansible, Trino, etc. I call this the Smart City Green City TechSoup. In the coming slides and in my demo you’re going to see how these multiple technologies are working together and making sure that the data is all on Kubernetes. Before we go into the demo, let’s understand the solution’s design part 1. So, this use case consists of ultra low emission zones where we have multiple different cameras installed in a certain boundary of a city. Here we have chosen London because in London there is an ultra low emission zone and in each station we have multiple cameras which are taking and capturing the live feed coming in from the traffic.
We are recognizing all of these cars that are passing by these toll stations. As soon as the car moves from the toll station we are capturing its number plate and getting the string out of it, appending some metadata like timestamp and location to each event and transmitting that from the edge location. We have multiple edge locations in the core data center for more computing and analytics. That’s part one of the solution.
The second part is how we can collect the data from the edge or multiple edge locations where we have deployed Kubernetes or OpenShift. While moving this data from the edge to the core, on the core we are doing some machine learning operations and retraining the model as the data is coming into the system to improve the efficiency of the model. Then we deploy back the upgraded and new version of the model onto multiple edge locations. That’s part two of the solution design.
Overall, it looks like this, there are multiple OpenShift or Kubernetes edge locations spreading across the city of London. And we have multiple cameras looking for the cars which are passing by the toll stations. Here we see our first pattern which is inferencing at the edge, how we can deploy a machine learning model on Kubernetes that will do inferencing at the edge to detect the license plate number and recognize it. When a new image enters into the system, we are doing an LPR (License Plate Recognition) model which extracts the boundary of the image that has the number plate and then passes this sub image into an Optical Character Recognition model which produces a string of the license plate. And once we have the string from the license plate, we are adding some more metadata like location and timestamp to it.
Once we have got the string of the license plate from the image itself, we have to move this data from these multiple edge locations onto the core locations. For that we are using a Kafka producer, which is deployed on the edge location and this Kafka producer is putting data onto the Kafka topic running locally on Kubernetes and OpenShift. Here, we see another pattern, which is how we can efficiently move data from multiple edge locations onto the core data center. So here we’re using a Kafka functionality called Kafka MirrorMaker, which helps you move data from a source Kafka cluster to a destination Kafka cluster. In our case, we are moving the license plate number strings, and as well as the additional metadata in the form of JSON from source Kafka cluster to the central Kafka cluster running on OpenShift. Now once we have the data onto the central core location Kafka cluster, we can do lots of interesting stuff with this, like, we can create our own custom consumers with some business logic, like “Hey, I want to detect some of the lost vehicles in real time and notify the government officials” or “I want to store this data for long term preservation onto an object storage, where we can use a tool like Secor, which is an open source tool” and we have a Kafka consumer that helps move data from Kafka cluster and store that in the relevant format that you’re looking for, into an S3 object storage bucket.
Here we are using OpenShift Data Foundation which is basically using Ceph to store all the S3 objects into the bucket which can later be used as data lake or we can read data directly from tools like Spark and Presto to the object storage. Finally, once we have collected all the data into the Kafka or a database or S3 object storage, we need to do some analytics and reporting like fee calculation and real time reporting. We are using a Starburst Trino which is Presto, a distributed SQL query engine that can read data from multiple heterogeneous sources. With the same SQL query, it can read data from a PostgreSQL database and at the same time from an S3 object storage like Exif or ODF (OpenShift Data Foundation). Once this SQL engine runs some queries we can use tools like superset and Grafana to build out a rich analytical dashboard that could be used for the business people. This is how we have started from a business problem and decoupled that into multiple modular patterns. And we have implemented it using several open source technologies. Let us now go into a demo where you can see all of these running. I will switch to my other terminal. This is the open source GitHub repository for data engineering jumpstart library. I invite you all to take a look and try to use it if you see any problems, please open up some issues, and I will work on it. It’s pretty simple. Red Hat data services/jumpstart library is the repository where you can find all these demos and sample code base. I’ll switch to my OpenShift environment. This is a pre-configured open environment and we have all the tools and sets that are deployed on this cluster. This is a project in OpenShift called a Smart city where you can see I’m running multiple Kafka clusters. I’m simulating edge and core into the same environment because it’s not real, we have not deployed it on tens of locations in the city of London. It’s a mock up. We are running multiple Kafka clusters. And then we are also running several services like Grafana and ODH, Secor, and Superset.
And how are we deploying multiple Kafka consumers image server generators. This is all pre-configured environments. And if you’re interested, you can quickly jump into the developer dashboard, where you can see some serious Computing, Data, Containers, Services, and other moving parts running on OpenShift.
This is the Smart City Ultra Low Emission London Dashboard for people who are managing the traffic from the central location and want to see how this data looks. Also, this data is coming in real time, images are coming into the system for this hypothetical situation that we have created, as you can see from this panel, and there is an object detection going on, which is detecting the number plate, model and the owner of the car (but this is all fake data). This is a panel where you can see the heap map and figure out things like “What is the most popular station in my city?”. Let’s say station A is getting most of the traffic and these all black boxes are simulating the edge locations where OpenShift and Kubernetes are running, Kafka clusters are running, Inferencing is happening at the edge and they’re all sending data onto the central location. Then there is another nice panel named Pipeline Ops-CPU, where you have a Gif explaining CPU usage and how it looks. So, we have edge locations and core locations and we have Smart city inferencing happening at the edge as data comes in, and via Kafka the data moves to the core Kafka cluster using MirrorMaker and then from Kafka the data moves to Ceph object storage using Secor. And once the data is in object storage or in our database we can use tools like Starburst Presto to analyze and do some querying on this data.
Let us move on to the second dashboard. This is an Apache Superset dashboard which is a part of OpenDataHub. OpenDataHub is a pretty interesting project for all the data data scientists and data engineers out there. Go take a look, there are a lot of tools and a lot of bindings available. We have deployed an OpenDataHub operator on OpenShift which provides us the capability of using Superset, Grafana, Presto and so many tools that are available. This is a live reporting dashboard. Gone are the days when you wanted to run batch processing queries over the weekend or maybe over the month or year end, right? With Kubernetes, Data on Kubernetes and how different tools are working on Kubernetes all of this is possible to do in real time. This is just a sample panel of the same data sources that we have collected. If I refresh this we will see some numbers change. Yes, the numbers got changed. We have got some toll fee collection of £392,000, total pollution fee collection is £182,000. Again, all of this is just made up numbers. I have a generator which is generating this random data but you got an idea, right? We are capturing the data, we are simulating this traffic and we are calculating fees on the vehicles. The idea here is to show you that these tools are available and help you make some decisions based on data and Cobalt is of course the data plane for this. You can deploy these tools and make sure that your business is making some intelligent data driven decisions and Kubernetes is of course ready for this. You should be deploying some data intensive workload on Kubernetes. With the data engine jumpstart library we are trying to make sure that you should have a variety of use cases or anything which is better than “Hello World”, YAML files on OpenShift or Kubernetes. Therefore, please take a look at this jumpstart library project. You will find it interesting and thank you so much for your time today. Cheers, have a nice day.