How Chick-fil-A runs Kubernetes at the Edge

Aug 28, 2023 by Paul Au

Brian Chambers, Chick-fil-A’s Chief Architect, shared insights into how the company strategically harnessed Kubernetes to tackle intricate challenges posed by high sales volumes and the need for real-time localized solutions. The company’s drive for technology innovation stemmed from improving kitchen operations. When faced with kitchen efficiency challenges, they embarked on an initiative to deploy technology solutions that enhanced real-time visibility and streamlined staff operations.

Brian details how Chick-fil-A uses container-based applications on K3s and relies on managed services for vital functionalities such as messaging, deployment, and authentication. With nearly 3,000 locations, the company demonstrates how a large-scale restaurant chain can creatively harness technology to address operational challenges, refine processes, and elevate customer experiences. It also underscores Kubernetes’ significance as a foundational technology for orchestrating complex architectures in diverse business environments.

Watch the Replay

Read the Transcript

Download the PDF or scroll to the bottom of this post.

Ways to Participate

To catch an upcoming Town Hall, check out our Meetup page.

Let us know if you’re interested in sharing a case study or use case with the community.

Data on Kubernetes Community

Operator SIG

#sig-operator on Slack | Meets every other Tuesday

Transcript

Hugh Lashbrooke (00:00):

Hello everyone. My name is Hugh Lashbrooke. I’m director of community here at Data on Kubernetes Community (DoKC). It’s great to have you here. This is August Town Hall. We do these every month on the last second, last Thursday of the month. Is that correct, <laugh>? The second last Thursday, second last Thursday of the month. We’ve got a meetup group https://www.meetup.com/data-on-kubernetes-community where we list all of these, which presumably you know about ’cause you’re here, but gotta mention it. and the recordings of these are up online on our YouTube channel after the event, not immediately after videos need to process, but they’ll be up online afterwards. so yeah, welcome. We’ll be hearing from Brian Chambers today, but I’ll introduce him in a second. just first off, I just wanna thank our gold community sponsors Google Cloud and poona. you can, if you go to dok.community/sponsors, you’ll see all of our sponsors there. But thank you to everyone who makes this community possible. it’s great to, to have companies like this on board. another thing I’m excited to announce is that we recently launched the DoK Community Ambassadors program. we now have 13 ambassadors in the program. if you go to the URL there, dok.community/ambassadors, you can see everyone. These are the eight people who joined this past month.

(01:30):

And yeah, you can have a, you can have a look at them, find out more about them. Ambassadors are people who are gonna be involved in the community. You’ll see them more more active in Slack. introducing welcoming people, starting conversations, getting involved in discussions, that kind of thing. if you have any questions, you can reach out to any of them. Reach out to me always. and yeah, welcome to our eight new ambassadors this month. Please go to that and have a look and learn about, about all of them. And then the final announcement I wanna run through quickly is that we have just some upcoming events. Obviously, firstly, there’s next month’s town hall ’cause that’s happening every month. we’ll be hearing from Travis Nielsen about Rook and using the, you know, helping the community — Kubernetes storage community — thrive.

(02:15):

That’s gonna be very exciting. We’re hoping to add another speaker to that as well. Also, just talking, looking at stor storage and that kind of thing. So coming along to that same place, same Zoom link in a month’s time. KubeCon is coming up in November in Chicago as part of the well KubeCon and part of KubeCon is having DOK day. you can register for that now. The event is on November 6th. that’s the link there. that is, yeah, open for registration. Go get involved. It’s gonna be a huge event. KubeCon’s, a whole bunch of co-located events altogether, all managed by the C N C F and D O K Day is one of those co-located events all in the same place. so yeah, have a look at that register and get there. That’s gonna be very cool.

(02:58):

And then this event is only in March next year, so it’s quite a little ways away. But the C F P is open. So the Southern California Linux Expo 2024 or Scale it’s coming up in March, but the CFP — call for papers — is open right now. you can apply to speak that’s open till November 1st, so you can go to that URL to to see that. and go to the call for presentations and, and, and apply there. They’re having a D O K track specifically. we’re working with them to make that happen. So it’s gonna be a track at the event focused on D O K. So yeah, when you apply, you can just say if you wanna speak at that track to say that’s the track you wanna speak at, and that’s what, you know that’s what you wanna do.

(03:41):

So, very exciting stuff. if you have any questions about that, please ask in Slack, you can, you’re welcome to ping me directly in Slack or just ask in the general channel and someone will answer. So, yeah, that’s, that, that is all the announcements I had, so thank you. I just wanna introduce our speaker for today. We have Brian Chambers here from Chick-fil-A. it is great to have you here. He is gonna be speaking on how they use Kubernetes at scale across the, across the network and organization, which is a massive network and massive scale. So, should be very exciting. I will let him introduce, introduce himself further. So I will hand over to, to Brian over to you. Thanks for joining us.

Brian Chambers (04:26):

Awesome. Well thank you for having me. great to virtually meet everyone. Hopefully you’ll have questions and we can get to know each other even better. So lemme get my screen sharing real quick, which is always the only obstacle we already know. It works, but it’s like, where’s the button when you need it, right? yeah. And, and don’t accidentally hit leave meeting or something like that. Instead that looks different. Now. Where is the share? anybody know, where’s the, where did it go?

Hugh Lashbrooke (05:00):

Share screen. It should be the screen on the bottom.

Brian Chambers (05:02):

It’s, it’s in green, which threw me off. Okay, I got it. share screen. And here we go. Hopefully this will work. hopefully you guys can see that. Okay. I’m gonna get a full screen.

Hugh Lashbrooke (05:14):

Yep.

Brian Chambers (05:15):

All good. Awesome.

Hugh Lashbrooke (05:16):

We got you there.

Brian Chambers (05:17):

Okay, cool. Well yeah, thanks again. so I’ll, I’ll start by saying all of you guys are probably deeper experts on doing things with data in Kubernetes than I am. So I expect while sharing our story, that I’ll probably get some great questions or learn some things from you guys. but want to tell a little bit about what we’ve done at Chick-fil-A with Kubernetes in an interesting way. A number of you guys may have read different articles we’ve published about this. I tried to tailor this to be a little bit different, so I’ll go fast over some stuff that’s contextual. if you have questions about it, feel free to interrupt and ask along the way. but really want to tell you a little bit about our architecture that we do across our restaurants, specifically about 3000 restaurants and sort of how like data and persistence do and, and don’t get implemented.

(06:07):

And maybe some interesting trade-offs that we’ve made that are somewhat unique to us, but can be interesting when you think about architecture, you know, and using Kubernetes in general. so, you know, my name my role just really quickly is I am the chief architect or leader of the architecture practice at Chick-fil-A. so we’ve got a team of, of folks who do like foundational shared technologies across our different groups our customers, our operators and team members our corporate staff and then some stuff in the business architecture domain as well. But we’ll focus more on the tech side for the presentation today. So if you happen to not know Chick-fil-A, I saw a, I can’t see the chat anymore, but I saw a comment that someone was eating Chick-fil-A during this, so approved.

(06:55):

that’s awesome. but if you don’t know us we’ve got roughly 3000 restaurants in the US and Canada. About 2,800 of those are run by us, meaning they’re not like a licensee at an airport or something. And so about 2,800 of them would have this infrastructure that we’re gonna talk about today. But we’ll just use the number 3000 ’cause it’s easier. we’re run by independent owner operators about 250,000 team members. A lot of sales. we’re really busy. we do more sales per footprint in six days a week than our competitors do in the full seven days that you’re, they’re usually open. and that matters not as a bragging point, but because it actually drives a lot of the challenges that we face at Chick-fil-A where we feel like technology is something that can really help us effectively address them.

(07:39):

so another stat to that end is that a lot of our restaurants are doing like, way more volume than what the physical footprint was actually designed to handle initially. So we’ve found ways to keep growing, you know, beyond just increasing prices or something like that to grow with the volume and the demand that’s coming from our customers. And that, of course has also made us number three in, in terms of US restaurant chains behind McDonald’s and Starbucks. So, a little bit about us, just so you know, if you weren’t familiar with Chick-fil-A you know, go visit a restaurant if we are, we’re in your area. we’re not in New Zealand sorry about that, but US and Canada at the moment, and Puerto Rico as well.

(08:17):

All right. So I wanna talk a bit about edge computing today and edge computing, really just in short putting you know, the, the compute capability and the workload as close to the users as is necessary to create the experience that you want to be creating. but no closer. So we didn’t do this ’cause we think it’s cool. we much prefer to build everything that we can build in the cloud because there’s just, you know, way fewer constraints. And there’s all kinds of amazing managed services and different kinds of things that exist there that that don’t natively exist like in the back of a restaurant. but there were things because of capacity challenges and places we wanted to apply technology to help our operator and team members in the restaurants. And then to create better customer experiences that justified thinking about you know, a a footprint that could work systems and apps that could work even if the internet connection was down, which, you know, it happens.

(09:13):

things that could operate with really low latency where needed, you know, think about things in an order pipeline or in producing food or things like that. so that’s kind of the why really quickly, and again, happy to field more questions about it. but we’ve written about it a bit and I’ve done a number of other talks about it, so I don’t want to drag people through it again if they’ve already heard the same story. so I wanted to tell you a little bit about how the architecture of these nodes works, just so you can kind of get an understanding. so that’s the next couple slides. So essentially what we did is we we put some intel nooks, three of ’em per restaurant. And this is a, a picture I took a couple years ago, but it’s still the same thing that we have.

(09:49):

And just to give you a comparison in size, like they’re about the size of a Chick-fil-A sandwich in terms of like their total height. So they’re pretty small. We’re really space constrained in restaurants. So like putting a, we don’t have like a server closet. We don’t have like racks. We don’t have all the fancy stuff that you would see in a lot of businesses, even maybe like other retail, like a Target or Walmart or something. we’re dealing with like small spaces and offices and things like that. So we had to go with a small form factor. and essentially what happens with these is this is oversimplified, but we essentially have these three nooks. Each of ’em looks identical and is identical across every store in the chain when they show up. And they basically have this stack to the right.

(10:31):

It’s the nook. There’s a disc partition scheme which there’s some cool stuff there that helps us do things like zero touch provisioning and remotely wiping and resetting these back to their initial state that we do. That’s kind of like a mirror of the way cloud network works, if you’re familiar with that. then we’ve got the the oss, which is Ubuntu, K3 is ultimately on top of that. no virtualization, no hypervisor, none of those things. it’s basically K3 is on bare metal with some little, like, homegrown things that we did to make management easier. And then the applications live on top of that. So to give you another picture, kinda what the architecture looks like we’ve essentially got this edge K3s, and I’ll talk about that in a second. And then a series of of services.

(11:13):

So think about the apps that live there. We’ve got essentially a bunch of platform applications that exist that assist the developers who wanna build things for this in restaurant platform. Those are things like local authentication services. We do a ton of M Q T T, which I think we’ll talk more about in a minute deployment, storage, et cetera. And then we’ve got applications that other teams can build. And I’ll show you kinda what that looks like in a moment. it’s also just good to be aware that what drove us to do this in the first place, I mentioned capacity and, you know, the challenge of growing restaurants beyond where they were. one of the things that we recognized as a business was one way to scale and, you know, and do better, continue to maintain great customer service and quality while doing more volume was to put more more technology in place.

(12:02):

And so a lot of what we started with was actually trying to get like more real time visibility to what was happening in the restaurant to get a better understanding of what was going on in the kitchen specifically is where we started. ’cause that was one of our biggest capacity challenges in addition to probably drive through, but kitchen’s a big deal. So we started off doing this with really a bunch of kind of smart kitchen is what we would call it use cases. And the very first things come along where a lot of the fryers grills holding cabinets and then like applications that were custom that were built, consuming a lot of that data. and then, you know creating things that made the team member’s job simpler. Like there we have an article out about a thing that we call Automated Holding Assistant, which is a, a solution that basically looks at chicken pans as they, you like, check it in.

(12:50):

It’s got barcodes etched on it. There’s a camera that captures that. And essentially, you know, tracks that pan through its lifecycle to make sure that like, when it gets to the end of a timer you know, it gets discarded. it’s a, it’s a food quality issue for us. So things like that, that are really hard for people to remember to do at scale and kind of cumbersome. They’re manual, trying to put automation to make, you know, systems that are easier for our team members to use pretty low touch but that make sure that we’re doing the things that are really important to our business really well. so there’s a, that’s just one example of the type of application that might live on top of the stack that’s in the restaurants. And then of course, we’ve got like a big cloud control plane that lives in AWS in our case, that does a bunch of the stuff to coordinate deployments across those consolidate data that comes out you know, provide observability the more centralized authentication authorization stack, all those kinds of things.

(13:42):

any, any questions so far about that? No, I’ll keep on cruising.

(13:51):

All right. I’ll keep going. so this just kind of underscores what I mentioned. you’ve got the stack, the red stuff being the things that exist under the control of the platform team, the one that’s rolling out and managing all the nooks and the Kubernetes and K threes and all those kinds of things. And then we have, we’re essentially think of it like akin to a cloud provider. We’ve got other teams that leverage the platform itself to deploy container-based applications to the edge, and then also use those kind of like not really primitives, I guess, but like those medium level services like M Q T T for messaging. So they don’t have to roll their own, you know, rabbit MQ or Kafka or something like that. you know, so they don’t have to build their own deployment services.

(14:39):

We’ve got that covered. they don’t need to worry about auth, and they don’t need to worry about database either, which is what we’ll spend time talking about in just a second. So we give teams a lot of autonomy. They can sort of like pick and choose monitoring and observability tools programming languages, things like that. But we are opinionated about the things in red, which is you have to use this platform, you have to be container based on top of K3s and you have to use some of these certain services and not roll your own just because it’s a constrained environment. That’s essentially the way it works organizationally.

(15:14):

before I go to data, I’ll just mention maybe a really fast, like why Kubernetes? this is a, a Kubernetes related group. So this all started, we’ve been running this for five, almost five and a half years. so we are pretty early-ish in the Kubernetes lifecycle. And what we knew when we started doing this was we wanted intelligence in the restaurant to be centralized somewhere. Like we didn’t want to have like sales transactions from point of sale being sent into fryers, which we’re trying to like, run applications on fryer to like decide something about what they should do. And people were talking about that with trying to automate cooking fries. for example, we knew that that was <laugh> not gonna be a good, a good scalable supportable architecture. And so we knew that there needed to be a place where we could have people build an application that could do things consume data in restaurant, you know telemetry from kitchen equipment, events that team members are doing you know, downloading things from the cloud, like sales forecast that we compute like in aggregate across the chain, making adjustments to it, things like that.

(16:16):

So we knew we wanted to be able to do those kinds of things, and that needed to live somewhere. So that led us to this whole edge footprint in the first place. That’s kind of the why behind actually having edge compute. and then if we’re gonna run containers, we, we looked at all the options, Docker Swarm and HashiCorp nomad and some other things that were around at the time. We actually, our initial glance at Kubernetes, we were like, it seems too big and heavy you know, for our footprint for a restaurant, et cetera. but right around that time, I think this is 2017, maybe the KubeCon in Austin, it just got really clear like, Kubernetes is where the industry is going. It’s where the open source community is gonna be, it’s where the momentum is gonna be. And we didn’t really wanna be off on some thing that left us, like writing all of our own stuff from scratch forever.

(17:09):

a great example might be like Vector, which we use here. Like it’s awesome that we can just plug Vector in it’s Kubernetes and get a lot of you know, logging and metrics pipelining, you know based on some awesome work that the, the community has done. So we wanted that. And and then there was K3s that came along, which for those who don’t know, you know, it is Kubernetes, but it strips out a lot of the a lot of the cloud related components. it was initially just supported SQLite instead of SCD and then it’s single binary. so just a little bit simpler from a stand up get it running perspective. it was mostly built by Darren Shepherd over at a rancher at the time acorn now.

(17:51):

So so all that together kind of got us comfortable with Kubernetes as a construct. And we kind of rolled with it and it’s been great. Haven’t really looked back or regretted that decision at all. but we did take a really simple approach to to using it, which was, we tried to not get really cute about everything that we could possibly do with Kubernetes, which is a lot. And it’s really easy to see your scope creep into a lot of things. we tried to keep it to the, the simple and basic stuff as much as possible, which is we want to be able to run applications across multiple nodes and have them get scheduled such that if the node they’re running on dies, you know, the, the cluster brings them back somewhere else. And then we do a couple other really very basic things with Kubernetes, but we try and just keep it as simple as possible because we’re talking about a, you know, a 3000 cluster footprint which if you get cute, I think gets really difficult to manage. so if there’s questions about that, lemme know. All right. So let’s talk a little bit about what types of data exist. So I kind of alluded to some

Hugh Lashbrooke (18:52):

Of this, Brian.

Brian Chambers (18:53):

Yeah, we have a question,

Hugh Lashbrooke (18:53):

Brian, just quickly. Yeah, the, well, I, I, I realized I forgot earlier that people attending can’t unmute themselves, but there’s a couple questions from the chat. Oh, yeah, sure. if I, before you move on to the next section, if I could drop those in. One was from Noll asking, should you use Ubuntu 18? So what’s involved in upgrading at at Chick-fil-A scale?

Brian Chambers (19:13):

Yeah, great question. So I mentioned well, we picked 1804 because it was the long-term support edition at the time. there were like some, some things that were great about that decision. There’s also some extra junk <laugh> that made its way in by then that was a little bit of a pain to deal with. we won’t go into all of it at the moment, but happy to have a sidebar afterwards if, if interested. but upgrading is a great question. So generally speaking there’s obviously patching that comes along and that’s fairly straightforward to apply to stores like a full OSS upgrade. we actually haven’t had to do, but we’ve done similar things like K3s and things like that. So full OSS upgrade. The way that we laid this out, I’ll go back to the slide real quick. just this one’s probably the best. So that partition scheme that I described, we basically we slice up the disc that’s available on each of these nodes into a couple different partitions. One is the one that actually has like like we have a, a caching one that persists through any sort of, like wiping or refer to wiping the cluster. we can basically take these nodes remotely and have them reset themselves to the initial way that they showed up, which is like the, the nook the base image that we have on it, on one of the partitions and, you know, in that partition scheme. so essentially we’re able to do some stuff with like flipping between partitions and then dynamically providing a sort of like an, a knit script to the node each time it starts up.

(20:46):

So we we, some things with overlay fss, so everything, like all the K three stuff, all the security patching anything that gets layered on to that base partition that you start with is on an overlay FSS partition. And we’re able to do some some little tricks with it where we can effectively confuse it based on the absence of the key encryption key that it’s looking for to make it think, oh, I need to reinitialize this. So we have some little tricks that we do from the cloud side where we can like toggle on and off and say, Hey, reset yourself to the initial state. And then when it checks back in, it picks up some instructions essentially that tell it what to do, like how to initialize itself you know, whether to form a cluster, join a cluster, et cetera.

(21:25):

So we can change those instructions per node, per restaurant dynamically whenever we want. We obviously don’t like get to a point where we have a bunch of unique ones across restaurants, but there might be like two different versions of it at certain times when we’re doing some sort of like rolling upgrade or something like that. So the, that’s the long answer. The short answer is some partition intentional partition layouts, some overlay FS magic, and then kind of like a cloud and knit type equivalent you know, way to tell it what its instructions are when it boots up the first time. Cool.

Hugh Lashbrooke (22:01):

There was there was one other question from Roberto Miller. He asked, do you use, do you use an operator within K3s to manage the Postgres instances?

Brian Chambers (22:10):

And we’ll kind of come to like our super simple Postgres implementation in just a minute. So standby on that one. But no, we, we don’t the only off the top of my head, the only operator I can think of that we have is we have one that we built that when secrets get changed we’ll do pod restarts based off secret change. But that’s like, that’s pretty much it. We’ve tried to stay super simplistic. I’m sure I’m missing something. but in terms of like custom operators, we, we haven’t done a lot. All right. Let get back to data here. Data. Any others?

Hugh Lashbrooke (22:45):

Those there any questions for the moment? I’ll Okay. Bring I the,

Brian Chambers (22:49):

I got the chat up now I was able to get it over the top here, so I should be able to see if there’s anything that comes in.

Hugh Lashbrooke (22:55):

Sweet.

Brian Chambers (22:56):

Okay. Awesome. So just to give you a flavor of what kind of data is involved here. So we talked about like kind of kitchen equipment telemetry. So just to make it a little bit more tangible, like for a fryer, for example, anytime it starts a cooking cycle or finishes, cooking starts a cleaning cycle, finishes cleaning and then as well as just like heartbeat events that come out all the time, like of all the different heating element temperatures, oil temperature, all these kinds of things. And you may be thinking like, what do you need that for? some of that is really useful for understanding like equipment performance and health for trying to look for, for potential maintenance issues. you know, quality issues, if like the heat the, the temperatures are too high or too low, that’s gonna impact, you know, cooking time which could be a food safety issue if it’s like on the low side it could impact quality if it’s, you know, something’s running too hot.

(23:50):

So those are are useful kind of telemetry items to gather from our kitchen equipment. most of our to be honest, most of our IT iot data today is kitchen oriented. but it comes across like fryers, grills, the holding cabinets some refrigeration devices. there’s, I think about, eh, eight 15 to 18 different discreet kitchen equipment components that are connected and sending out data through the system. So that’s the majority of the kind of IOT stuff. the next class I would say is like app generated data, meaning like an app that’s running on the edge. so it could be like we would consider the fraha, or sorry, aha. automated holding assistant solution. there’s another one for fries called fraha. but we would consider that to be like an app that runs in restaurant at the edge.

(24:44):

It’s got sort of the intelligence externalized to the edge compute nodes, and it’s generating data about what’s happening in that system as well as consuming things you know, over m Q TT within the restaurant as well. So that’d be kind of another bucket of data is things that those apps are creating, or perhaps even like downloading from the cloud, like a forecast or something like that. we’ve got POS, POS is kind of like influx for us. So we initially for most of the time the solutions existed. We’ve been on a proprietary or commercial off the shelf with changes made by us point of sale system. And so we did some things to basically get finalized transactions, like poked a hole from the POS world over to this edge ecosystem.

(25:35):

But we’re actually actively redoing our entire point of sale infrastructure. And that’s gonna open up over the coming months and, you know, years chain wide. But over the coming months everything from our p o s ecosystem, which was really closed historically to be able to flow through here, which means lots of other things you can do from an application perspective. So keystrokes, meaning like every time somebody hits something on a register, which would represent like, you know, imminent demand for that item, somebody wants it right now, they’re ordering it right now which can be helpful to not drive exactly what we’re gonna cook, but, you know, adjust forecasts or things like that. Finalized transactions, you know, speed of service metrics, different things like that. So there’s a lot of stuff in the p o s ecosystem that’s really important.

(26:20):

That’ll be unlocked. it’s starting to now we have it in some stores, but it’s not to chain wide scale yet. Another class would be like all the system telemetry, so the stuff that operate and manage both the platform, so all the things in the red boxes on the slide we looked at a minute ago. and then of course we have customers of our platform building applications, and we need to them you know, metrics and logs and observability data about the solution as well. So I just did a talk about that at Datadog dash a week and a half ago. So that should probably be online if anybody’s interested in the kind of the observability story. we won’t go into that one much more from here. And then finally I consider this to be incoming because we only have a few footprints of it at the moment.

(27:04):

But one of the things that we want is to do kind of what we’ve been able to do in the kitchen in other parts of the restaurant. So we’ve got a number of solutions we’re evaluating at the moment at you know, small scale, like less than a hundred stores, but operational telemetry data, meaning like business operations. So you know, like seeing for each car that comes into the lot and into the drive-through, like what’s the total time it takes us to serve them you know, so knowing they’re on the property, you know, this is the point at which they were first had their order taken, which isn’t always a fixed point for us. We have people out with tablets in the drive-through taking orders, so that could be dynamic. So being able to know when did they get their order taken and how long was it before they got to the window, and then how long before they could actually exit the property that would actually represent like the full, you know, drive-through customer experience.

(27:55):

Whereas today, all we know is from the time somebody hits a button ringing their order in for the first time, she may have been on property for a minute, two minutes, five minutes at that point. Like, we don’t know until they actually, you know, make it all the way out and leave. So so things like that. And same idea indoors with tables, like knowing occupancy of tables so that we could maybe trigger to know when to go clean them or things like that. and, you know, time in queue in front of the registers and on and on and on, right? So we’re looking at some CV and LIDAR related solutions, CV driving more more compute and G P U capacity in stores than we would ideally like to have to deploy and figure out how to manage at scale like we have with this other stack.

(28:36):

And lidar being a little bit more lightweight. And also you know, non-invasive you can’t tell who a person is, so there’s no privacy concerns there. So we kinda like that, but those are some of the data elements that are inbound and that we’re working on actively. Oh, on the chat. Okay, cool. So talk a little bit about persistence, which we’ve made a lot of like simplifications in terms of how we think about this. So we just talked about the data flowing through the restaurant and kind of playing off, you know, data is the new oil. essentially I would say you could simplify this whole architecture for, in the stores to say like, this is a way to enable data flow through the restaurant in a, like a non-siloed manner that multiple people can tap into and consume when they need it.

(29:24):

And that’s really the way I would think of it. So it’s not a repository for long-term storage. It’s not an oil refinery where we’re doing all of the fine tuning and perfect, you know, data cleanup or anything like that. It’s really a, a means to ship things from in restaurant, ultimately out to the cloud, which we still try and do whenever we can based on connections being up and things like that. but essentially it’s a it’s that, and it’s being able to tap into it if you have something that you need to do in the store, which is, maybe breaks the metaphor a little bit, but I just imagine somebody coming in and siphoning off what they need you know, for their use case and, and building off to the side a little bit you know, in, in this picture.

(30:05):

So that’s, that’s the way I would think about this architecture. And that leads us to our persistence strategy, which I think you guys will find interesting. <laugh>. so three big things here, and I’m sure there’ll be questions, so feel free to drop ’em in the chat. number one persistence at the edge is best effort. So we do not do any service level agreements. We make no guarantees that it’s gonna be, it’s always gonna be there. if you use one of the databases that we have available, which is primarily the Postgres one, if you use that Postgres database, like we’re gonna do we’re gonna do best effort to, you know, to keep it up within Kubernetes. But we made a trade off to say, we think given the number of footprints and the potential issues that could happen with making a really highly available database that people come to trust, and where it breaks things significantly when it’s not there, we are gonna end up in a lot of trouble.

(30:57):

And so we went with this approach of persistence is best effort. It’s not you know, it’s not highly available. We would even probably say it’s more of a a recoverable architecture than an available architecture, a as is our entire edge stack. so one of our management techniques for Kubernetes cluster problems or misbehaving items as well as like patching and upgrading is, as I mentioned, wiping a node back to its initial state and REIT initializing it and having it rejoin the cluster either like at the whole restaurant level, if something is terribly wrong. so we could blow everything away and bring it back, or more likely it’s at a node level and it’s able to be blown away, come back and rejoin the cluster. But that’s a a trade off that we made. and it, it has served us well actually, and there’s some, the next point we’ll explain that, or two points away, we’ll explain kind of how we mitigate that as well.

(31:49):

so we have a local database for convenience purposes. we actually have two local databases, which you’ll see in the picture in a moment. but Postgres is the one that any of our teams would use for whatever they need. we would say treat it as a tool. so a place to do, to to drop stuff temporarily to maybe do some, like if you’ve, if you’ve got something that SQL is really handy for you know, if you want some degree of persistence outside of a an app restart or something like that it can be useful for that. But it’s not intended, again, to be like the long-term repository for the applications data, or like the business’ data as a whole. It’s kind of a tool in the toolbox that you can use you know, with your app.

(32:30):

And then what that means is that we actually encourage, and this is the reason for the iPhone picture I think it’s the most tangible way to talk about it. So when you get a new iPhone you know, you like turn it on, you connect your iCloud account and everything like magically rehydrates, right? Like your photos come back and your apps that you had downloaded come back, et cetera, et cetera. that’s essentially what we have encouraged development teams to do with our applications at the edge as well. So you may not always be online, so sometimes you may have a state that hasn’t been sent out to the cloud, and that’s okay. And in most cases, there’s enough resiliency in the architecture that it’s going to, you’re gonna be fine. Like you’re gonna get a connection back. It could even be a day later.

(33:12):

It’s very unlikely. they have a situation where you know, you lose all of the nodes in the cluster and the internet connection was down. It’s probably more of a disaster, like a lightning strike or something if that happens anyway. so we don’t really see a lot of those cases, but even if we did, most of the data that we’re dealing with that’s in this ecosystem is important for making like near term automated decisions, not for like long-term like financial reporting or something like that. So generally speaking, we’re encouraging apps to send data out. So flow through, consume, use in restaurant all great when you have a chance, like hydrate the state that you care about out to the cloud. And then if your app goes away, one of the things it should do when it starts up is, see is everything I need here, and if not, let’s suck things back down from the cloud.

(34:01):

and, and Reinitialize. And that pattern has both been fairly straightforward for teams to work with in the restaurant edge. And it has made managing a Kubernetes environment like this at the number of instances that we have much more manageable. Like there’s, there’s a lot more support vectors that we can take that we couldn’t take if we had to never lose anything at any store ever. Like, I don’t think we would’ve succeeded if we did that with the, the team that we had, the size that we had and the use cases that we’ve had. So I think it was a good trade off, but definitely a trade off for sure, and not one that probably everybody can make. So here’s a little picture of what that data flow in the store looks like. So the bottom would represent the edge.

(34:45):

so we’ve got MongoDB, which it actually is really only use to back it’s actually eighties which is a M Q T T broker that’s from Moscow jss and, and eighties doesn’t have a logo. So Mosca it is, but that’s our M Q T T broker which is really our primary data hub for the restaurant. So not a lot of like rest a p i interaction inside the store or anything that you might typically see in like a microservices architecture in the cloud. It’s more message oriented. And so most of our applications that we were talking about are really pub subbing M Q T T, and then most of the rest stuff that they would do would probably be interacting with like services that are deployed in the cloud outside of the restaurant. so that would generally be to pull things down like a forecast or something like that, or to publish things out that are specifically related to their app, up to like kind of their control plane for the app across the, the whole restaurant footprint.

(35:42):

So that’s pretty typical. Mongo’s basically there to back that we don’t make Mongo available. since we have Postgres, we don’t make it available to application teams. again, should be simple use cases. that’s what the Postgres one is there for, and Postgres we just talked about. So that’s what we have there. We hit Mosca. And then in very rare occasions, we’ve had reasons for people to write things to like local file system. And we’ve been playing with, we aren’t at full scale across the entire chain, but some of the stores have Longhorn running across. from a persistence perspective, we hit a couple little issues with it. I’m, I’m not expert enough to tell you exactly what they were, but we hit a couple issues with it, so we haven’t fully rolled it out yet.

(36:25):

But a bunch of stores have Longhorn doing replication file system, so distributed file system effectively for, you know, a persistence store as well. So kind of, you know, SS three ish, but certainly not as <laugh> as resilient as it would be. So that’s essentially a bit of the storage service and persistence storage as it relates to the restaurant edge. So almost at the end of my slides. but some design principles that maybe I can just summarize here that we’ve kind of talked about implicitly throughout so highly recoverable architecture over highly available like generally speaking, the way we do Kubernetes, the way we do databases, the way we tell people to build their apps are more about being able to recover and get back to a state that makes sense than they are.

(37:17):

Like, let’s make this the most highly available architecture that’s ever existed. HA is awesome in the cloud and very reasonable with, you know, multiple availability zones and regions and you know, distributed database services and all these things that that you can leverage. But that is very, very difficult to replicate in a quick service restaurant even one, let alone like 3000. So we made that decision to go that, that route, and that really applies to most of what we do. I mentioned this before, but we try and keep our K3s usage to the bare minimum. So as a little cuteness as I like to put it as possible and you know, make trade-offs and and accept that, you know, this is a constraint environment. So we have to allow those constraints to sometimes get us to engineer for things that are important.

(38:08):

Like I would say our partition scheming and our overlay FSS stuff and our ability to like wipe a device back to its initial state was a lot of engineering work that was around some of the constraints that existed in our environment, like not having I P M I or Pixie Boot or things like that, that would’ve been really awesome. So we really had to engineer around some of those constraints because being able to support this at scale was really important. But other constraints, we we had to say that we’ll accept that and we won’t offer maybe the same levels or the same guarantees that we would do if this same architecture lived only in the cloud, for example. And that really, again, came back to ability to operate at scale which, you know, if we can’t do that, it doesn’t really matter how awesome all the other decisions were.

(38:48):

we’ve gotta be able to operate it at scale. So another trade off there. but we’ve, we’ve been happy with that one. And then, like I said, we have no persistence SLAs, but its best effort, it usually is there, that’s great, but it app teams can’t depend on everything always being up and things that they wrote always being available. So it encourages a little bit different application design philosophy that’s been a journey as people learn to operate within it. But certainly none of these three are things that we have regrets over. it’s been empowering and and it’s been good trade-offs to make the architecture supportable.

(39:32):

And that’s all I have in terms of slides. So I think questions is would be the next thing if there’s if we still have time, but if you’re interested in something that you think is tangential to this and you wanna learn more, you can find me in a bunch of these places. LinkedIn I do a sub stack thing and then Chick-fil-A specific stories, you can find a lot of these, like a lot more about the context behind the edge architecture, that aha article that I mentioned, which is really interesting, like what one of the teams on top of the platform did. They were actually our first customer and more that’s all on the Chick-fil-A tech blog. that observability story is actually there as well. So you could check that out. And I think that’s it. Are there there other questions? I don’t see any in the chat, but other questions I can answer?

Hugh Lashbrooke (40:14):

awesome. one question just came in right now. I can read it out so people can sit on the recording, but how do you deal with the connectivity between the edge and the cloud? Do you use some V P N or some other tools such as scpa or something similar or simply encryption or authentication over open internet?

Brian Chambers (40:31):

Yeah the latter. So simply encryption and authentication over open internet. So we do secret Store like a managed secret store, which is a HashiCorp Vault, runs in the edge environment. And we do synchronization of secrets as part of the part of our whole like edge bootstrap process. So they have secrets there that they can get to access protected services. But then there’s also I’ll try to do this without going too long. when I showed you put back on the slides here, when I showed the oh — let’s see here yeah, off, off services here. This is probably the best spot to look at. So essentially we have a we call it go off. It’s a, a custom homegrown Oauth compliant server. The, if you’re asking why would Chick-fil-A do that, why not just buy it? there were a couple things that we well cost for one in terms of the number of clients that we have. And then two we wanted some of the flows that were not yet available in a lot of the commercial products like OAuth device code flow which like what you’d see with like a, you wanna log into e s, ESPN on your Apple TV or different things like that, you get that kind of asynchronous flow of enter code, et cetera. we did that for some of our devices early so that they didn’t have to like do anything on the actual device to to log in.

(42:02):

So externalize some of that. So anyway, we have a, an Oauth server that lives on the cloud side, and then we have a local one. So the cloud one’s job is basically onboard anything that needs permission. So that could be the, the Intel nooks themselves end up with identities for various reasons. any iot device, prior grill, et cetera, those all onboard through OAuth as well. And then any of the applications that exist in this edge footprint also get their own identity. So we basically are minting a Jot a J W T with the permissions embedded in there. So it’ll be, it’ll be full permissions for services that can be accessed both M Q T T topics pub and sub at the edge. And so we enforce those in our broker and also for cloud resources as well.

(42:49):

So things that they might wanna access on the cloud side. And the reason we put the full permissions in there is for this offline resilience thing. So if you can’t reach the cloud broker, but you need to keep operating in restaurant the, the local op service actually access a proxy first, and it just is a pass through if the cloud auth server is available, but if it’s unavailable, that can actually resign with a different resign the same permissions with a different OAuth provider in store and maintain continuity from a, an authentication perspective. So that’s how that works. But yes, we ultimately just use that facility to talk to protected resources on the cloud side. And we don’t do any sort of VPNs or anything. all traffic is outbound from the edge to the cloud. We never ever, ever have any case where there’s a inbound web request or message or anything else. It’s all like, it could be over the M Q T T rails, but that’s a connection open from the store out to the cloud. So that’s the way that that ultimately works. great question. Thanks.

Hugh Lashbrooke (43:53):

Cool.

Brian Chambers (43:53):

See a bunch of questions. I’m just gonna go back and read <laugh>.

Hugh Lashbrooke (43:56):

There’s, there’s quite a few. what we, what we might do, we’ll take, take that from Mark. and then maybe the rest we can answer in a follow up where we’ll do a recap blog post. Maybe if you can just type up some answers to these, we can pop them in there.

Brian Chambers (44:13):

Yeah, I can, I can type answers to these. we’re out time. No worries. Yeah. So team size and experience. so we led this out of our architecture team. we had three of us at first plus let’s see, a couple over the next handful of months. So getting to live go to like go live and launching across the chain was like a really small team, only like five or six people total. definitely not enough <laugh>. I’ve talked a lot about I’ve learned a lot and talked a lot about appropriate platform, team size et cetera. what was good is that we did have like some really brilliant people. I’m not talking about myself. some really brilliant people who were really good, like had a really good generalist skillset. So I’m thinking of one person in particular who did a ton of the work around everything from like Intel Nook, you know, base to figuring out the image and working through all the challenges that we face there, man, figuring out the partition scheme, figuring out all the, the interesting overlay FSS stuff, the cloud services that interact with that to make that possible.

(45:25):

And like we’ve, we agonized over what if something goes wrong, anything goes wrong, and we brick all of the devices across the chain. We’ve got, you know 7,500 plus bricks out in the field that we’ve gotta figure out how to bring back, how to re-image nothing’s working, that’s depending on them until that gets fixed. Like we, we lost a lot of sleep, I think, over that. And that’s never happened because the rigor that was put into this into that piece of the architecture. So we had somebody with that skillset who was able to go all the way from, like, they’d done their own Linux distributions in the past. They could go read about overlay FSS and be like, oh, I can do some cool stuff with this that’ll actually take care of a lot of problems. the same person learned from like the Ubuntu knit pattern and was like, huh, I think we can use something like that.

(46:14):

So there’s a skillset like that all the way up to like good software engineering. we had another person, you know, 20 years of experience doing all kinds of emerging tech related stuff. So broad base ability to learn new stuff like Kubernetes quickly. I was there I don’t know what I brought to it, but something and then a couple other engineers who were able to be like, you know, executors on some of the pieces in, in addition to everything we’ve talked about, there was also like a software development kit that we give to our partners so that they can onboard to this platform and to abstract some of the services, which in hindsight was, I would say is not a recommendation. It’s actually probably an anti-pattern to abstract like OAuth and M Q T T with something proprietary because people struggled with it and we were the only line of support. but things like that those skill sets like that made this possible. as it’s grown, it’s gotten a bigger team to support it ongoing. And that’s been a little bit more of a, not as many like deep generalist skillset sets, but some people who have like platform engineering backgrounds or, you know, good systems engineering like Golan experience and things like that. So that’s kind of the, the rough skillset I would say that we we’ve seen across the team over time.

Hugh Lashbrooke (47:31):

Cool. Yeah, there’s a ton of questions here, which is fantastic. we are running short on time and we do have a little quiz at the end where Diogenese will spin it up and we can win a, a run D O K T-shirt. It’s like run DMC, but run DOK. Cool. so we’ll I think let’s jump into that. and then these questions here, we’ll save these. And then Brian, we always, we always do a recap post of these, of these tunnels afterwards. Brian, if you could maybe just type up some months, those we’ll include them in the recap post and then we’ll share that around them. Yep. Can catch up there. Cool. so if you do have any other questions, please Karen, drop ’em into the chat and we’ll yeah, and we’ll see how we can answer. cool. So, so let’s jump into a D O K quiz that we do each time, different questions each month, but quiz. So if you go to you can scan that QR code with your phone that’s on your screen there, or go to menti.com and enter that eight digit code 6 4 9 0 7 1 5 8. it’s just a multiple choice quiz, very quick and easy. yeah, we’ll give people some time to join that. Scanning the QR code’s probably the easiest way to do it, but you can go to the U R L as well.

(48:49):

I see there’s a few likes coming in, which I assume means those people who have joined. Cool. <laugh>. thanks. All the questions everyone. It’s great to, great to see engagement and thanks so much Brian. It’s been really informative, really interesting. yeah, it building some very cool stuff there and it’s great to see to get all this info. cool. Shall we get the quiz? Get the quiz going. <inaudible>, you wanna get to the next slide? Cool. Oh, three questions today. It looks like we have 14 players in there. Great. So there’ll be a time limit on the questions. The faster you answer, the more points you get and we’ll see In the 2022 D o K report, what percentage of revenue did most organizations attribute to running data on Kubernetes? Obviously you have to have read the report for this, I suppose <laugh>. but yeah, pop answers in there if you, you’re on so quicker you get more points, assuming it’s correct. Of course.

(49:51):

Cool. Well, three people got it right, 11 to 20%. Let’s see, that’s the leaderboard. Earthman up on top. Cool. whoever gets top score here can just send me your postal address so we can get a t-shirt and rt. Okay, next question. What percentage of data workloads are most organizations found to be running on Kubernetes? Is that less than 10? 10 to 24%, 25 to 49, 50 to 74 or 75 to a hundred. Oh, no one got it. Right. I think the questions are a bit, maybe a bit too difficult this month. <laugh> leaderboard, no changes there. And I think there’s one more question, question three of three.

(50:58):

Last question. What did most organizations report in terms of productivity while running D O K? no. More or less. 10% increase, 50% increase, two times increase or over two times more productive. this report is available on dok community — There’s a link to it. it’s great. Industry report from 2022. 50% increase. Correct. Four people got that right. Let’s see the final leaderboard then. And what does this mean? Snuggles by a long way. 1 676 points. Nice work. snuggles, I’m not sure what your real name is unless your real name is snuggles, but it is, that’s pretty cool. that’s very cool name. if you could just pop me a message with your with your postal address your name and postal address and a phone number. ’cause that’s good to include in that. you can just find me on the DK Slack or [email protected].

(52:02):

you can drop me a email there. or Slack, whichever’s easier for you, Jeff. We’ll get a t-shirt to send off to you. Thanks so much everyone. the recording of this will be up on YouTube in a little bit later, later today, later this week kind of thing. And there’ll be a recap post with answers to questions from Brian on our, on our website. I see Brian has been answering questions in the chat, so that’s great. we can use that and then if there’s any others that’s that. But yeah, thank you. Thank you everyone for joining. I really appreciate it. I feel like maybe it’s, there seems to be a lot of conversation going on the chat. It might be worth just keeping this zoom open for a bit for that <laugh>. I’m happy to, I’m happy to do that. we can just, you know, keep it muted and just so anyone people are still chatting, I’m happy to leave that. So maybe we’ll do that for a few minutes until that’s over. But yeah, thanks everyone and thank you so much Brian. it’s great to, great to have you. And yeah, remember next month’s town hall, go to the meetup group and you can register on there. See you, see you all bit again. Appreciate it. Cool. We’ll leave this zoom open for a bit if people wanna answer questions there.

Data on Kubernetes Day Europe 2024 talks are now available for streaming!