Data on Kubernetes Day Europe 2024 talks are now available for streaming!

Watch Now!

Build your own social media analytics with Apache Kafka

Apache Kafka is more than just a messaging broker. It has a rich ecosystem of different components. There are connectors for importing and exporting data, different stream processing libraries, schema registries, and a lot more. In this talk, Senior Principal Software Engineer at Red Hat Jakub Schultz, shows how to use Kafka to read data from social networks such as Twitter, process them and use machine learning to analyze them. With everything running on top of Kubernetes.

Jakub Scholz  00:00

So my name is Jakub Schultz, and I work for Red Hat. And really, what I work on most of the time is the CNCF project called Strimzi, which is all about running Apache Kafka on Kubernetes. And that’s what I will of course, use during this talk today, as well. I hope all of you know and heard about Apache Kafka. But I think it’s quite important to understand is that it’s more than just a messaging broker. It’s an even streaming platform, and it’s great ecosystem of different components and tools. Some of them are part of the Apache Kafka project itself. But others are first-party components and integrations and connectors and so on, which all packed nicely with Kafka and all can be used together. So overall, Kafka can quite easily handle three different areas. When it comes to working with data. 

It can do the basic messaging job of delivering the messages, but it can also store the messages including storing them as a long-term storage, especially if you use the pattern such as event sourcing, and so on, you know that you can store their events for years if you want. 

But it can also handle integration. So it has this component called Kafka Connect API, which really focus on integration with other systems by importing messages from other systems into Kafka, or exporting them from Kafka into some other systems. And it has also its own stream processing library, which is called Kafka streams API. And you can do stream processing there, including stateful, operations, joints, and so on. So it’s really quite, quite powerful library. 

And what we will use these things for in my talk today is, we will use the Kafka Connect part, together with connectors from another Apache project: Apache Commons, which is quite well known as well and provides hundreds of different integrations. And one way you can use Apache Commons is also as a connectors inside Kafka Connect. So we will use this connector to connect to the Twitter API’s and do some searches on Twitter search for some tweets. And when we find that tweets matching our criteria, we will take three, then pass them as a message to our Kafka broker. And from there, we will pick it up with our stream processing application, which will read the tweets and do some sentiment analysis on them using the Deep Java learning library. So basically applying machine learning to decide whether the sentiment of the messages and then the sentiment is, is positive or negative, then it will basically identify these messages, and it will automatically retweet them by sending them back to the Kafka broker where again, the Kafka Connect will pick it up and exported back to the Twitter API’s. And you will be able to see them as tweets. And of course, because we are on Data on Kubernetes Day, then all of this will be running on top of on top of Kubernetes. And it will use the Strimzi operators. 

So let’s have a look at how it works and how we can make it running. If you want to take part in this demo, you can just read something and use this hashtag #BYOSMA as in build your own social media analytics. And if you use this hashtag, the sentiment analysis application should automatically pick it up and analyze it, and you might see your tweets in this demo. 

So let’s first switch to the command line and check what we have running already. And as you can see, I already deployed everything to save time on downloading the container images and starting things, but I have the Kafka cluster running here I have the Zookeeper cluster running here. Here I have my Kafka Connect server as well, and here is the sentiment analysis application which will be analyzing the tweets. And the important part here is the is the Strimzi operator which if you follow how the operators are working and the operator patterns, then that’s what’s really running and managing all of these components and to plug into the previous talk, you can of course, install it using OLM and the operatorhub.io as well. 

 

Now, just to briefly introduce how the operator is working, you basically create custom resources like these, where you specify the whole Kafka resource, including the resources, Java configuration listener, security, as you can see is just a small deployment running on my home cluster here. I can also configure authorization, authentication, storage, observability, in form of tracing or Prometheus metrics, and when you deploy it, then the operator takes care of the rest. And because basically, everything I’m showing here is done in a declarative way and can be done using things such as, such as gitops. You can also do exactly the same for the Kafka Connect, and because the Kafka Connect, connects to the Kafka broker as an external obligation, I first have to create this, this Kafka user, which you can use to authenticate, and I can specify the authorization rules. And then I can specify the Kafka Connect deployment itself, where I specify which connector plugins should be loaded into my connect deployment. Remember the ecosystem thing and downloading third-party plugins, and the operator will automatically put together a new container image for me with these plugins and automatically deployed. 

 

And then I can deploy the actual connectors as well. So I create a topic which is called Twitter inbox. And then I create this Kafka connector, which will use the credentials to connect to the Twitter API, and then it will search on Twitter for the hashtag, which I mentioned. And whenever a message for disaster is published, it will automatically blow this from the Twitter API and send it as a message into the Kafka broker. And I have similar connector for outgoing messages as well. So directly the messages which it finds, and the last piece, I need to build my own social media analytics, here is the sentiment analysis, where, again, I create a Kafka user. And then it is just the regular deployment bond thing, which makes the Kafka streams API a bit special is that it’s not some complicated framework with some workers and some servers. But it’s really just a Java library, which you can include in your application. So in my case, I’m using the Quarkus framework, but you can use plain Java and so on, and you just build it and deploy it. 

 

And now when it’s running, you can get back to your to your browser, and this is my Twitter account, or one of my Twitter accounts, and I can tweet something and try to tweet something nice data on Kubernetes. They started well, the schedule is full of amazing talks. Hopefully, that’s positive enough to catch up with the sentiment analysis. Now I have to add the tech build your own social media analytics, and I can now tweet it. And now this Jakubobot account, that’s the account of my bot, which I’m using to retweet things and if I press refresh here, you cannot see my tweet here. But you can see all kinds of other tweets, “Ricard”, I hope I pronounced it properly, tweeted a year about it, and it was identified as negative. Oh, that’s not nice. It looks like the machine learning library could use some more training because I don’t see anything negative on that. But you can see how it’s how it’s retweeting these tweets, and then it catches up with so this one was positive, great. So we can see how it’s catching up with the different tweets and how it’s doing the social media analytics. 

Now. There was not that much time to go in detail through the Java application, doing the analysis, and through all the YAML files and so on, but you can find all of them in this, this first link which takes you to the GitHub repository with all the sources. And if you want you can even try to do this yourself at home. Thanks for watching!