Sourcegraph is a universal code search engine that allows developers to search across their entire codebases. It can be deployed on-prem or used directly from the cloud. The service currently indexes over 1 million open source repositories and provides code search across them. Sourcegraph is planning on indexing every open source repository with more than 1 star on GitHub, GitLab, and other code hosts. To achieve this scale, it relies on Kubernetes to scale several stateful services.

These services range from commonplace databases like Postgres and Redis to bespoke stateful services written in Go. This talk describes these services and the tradeoffs made in their design to aid practitioners in developing their stateful services. This talk also discusses the failure scenarios the Sourcegraph team has encountered. Examples will include long PVC mounting time causing downtime in seemingly unrelated services, and the on-call culture Sourcegraph has adopted due to many novel services. Lastly, this talk discusses plans to make our existing stateful services more resilient and highly available with information from previous downtime incidents.

If you want to gain an understanding of the challenges of running stateful data on Kubernetes – this talk is for you. Newcomers will understand possible failure cases and how to remedy them. Practitioners will gain knowledge to help them design stateful services on top of Kubernetes and ideas for making those services highly available.

This talk was given by Sourcegraph’s DevOps Engineer Dax MacDonald as part of DoK Day at KubeCon NA 2021, watch it below. You can access the other talks here.

Bart Farrell  00:00

Our next speaker Dax MacDonald is joining us from Sourcegraph. You heard Melissa talking a lot about the differences between stateless and stateful. So in this case of Sourcegraph, they’re running all different kinds of stateful services. And Dax is going to give us a little bit more info directly so we can turn it over to you now. Welcome, Dax 

Dax McDonald  00:26

Let me give you this great overview of how we run stateful data at Sourceraph.

Bart Farrell  00:33

Just a reminder, folks, you can leave comments and questions in the YouTube chat. But also, if you really want to continue the conversation afterward with the speakers jump in our Slack. And that’s where you can have more direct interaction. 

Dax McDonald  00:47

I’m Dax McDonald. I’m gonna be talking today about how we use stateful services at Sourcegraph index, the world of open source. I previously worked at Rancher Labs on multi-cluster management. Now, Sourcegraph, we’re focused on bringing code search to every developer. And we’re hiring- if this sounds interesting, please reach out to me after the talk. Here’s an overview for today- I’ll be talking about why Sourcegraph needs so much stateful data. And I’m gonna be focusing on two particular copaths in our product: indexed search and non-indexed search. I’ll be providing a high-level overview of stateful data in Kubernetes. I’ll talk about some of the problems that we’ve encountered, and our attempted solutions. 


So first of all, I’ll go here, if this demo works I’ll be able to demo some of the awesome things that we can do with Sourcegraph. So this is kind of an example of Sourcegraph- searching for this thing across my entire code base, I can search across the entire world open source code. And there’s some features in here that might be more familiar, that may even be familiar to those who use a powerful IDE. So these are features that we have just right in the browser. These are pretty cool. I always love to do things that we can do with our product. Bringing these awesome code search abilities to tons of people. This is an idea of what you can do with Sourcegraph. But now we’re on and in comparison to our on-prem product, we’re actually indexing over 1 million open source repositories. We have a goal to index every single repository on the web with more than one star by the end of the year. Right now it looks like 240 terabytes of repository data that we manage, and 90 terabytes of index data. 


The Sourcegraph architecture, we’re just going to be talking about a slice of it. It’s a large product with quite a few different subservices, we have your normal collection of Postgres databases, utilize Redis for caching, and we actually have an S3 compatible object storage layer. We have several bespoke stateful services, which we’ll be going to be focusing on today. And we also have a large amount of caching that depends on the emptyDirs, a type of ephemeral volumes in Kubernetes will be discussing a little bit more. A kind of unique thing about Sourcegraph, we’re really storage bound. High speed storage is what makes your searches faster. That’s what we’re always looking at, the ways to get higher performance out of the underlying storage mediums we use. 


Here’s the focus on index search. Index search uses a trigram based search application underneath it called Zoekt from Google. It creates these indexes over time. This is where you get those millisecond-level searches that are really high speed. By default, you’re indexing the primary branch, then this is another theme of this talk, because I’ll be talking about how these services scale. This is scalable with human oversight, is how I would like to describe it. By that we mean when you increase the number of index servers, the number you need to consciously make sure that you’re there, making sure that there’s not a large rebalance event that causes your data at the shards to cause downtime for these search services. In comparison to some of our other services we’ll be talking about, like searcher, which is unindexed search. This one is what we use for anything that’s not outside the default branch, or when you’re searching across multiple revisions. At Sourcegraph, you can search across every single branch and repository. search across a series of commits. For that we utilize Searcher. That’s gonna go out essentially requests a zip file of your repository, a certain revision that a branch, be that a commit, and then caches it ephemerally locally on disk. Then, in comparison with index server, this service, you could scale up to hundreds of replicas as long as you have that cache space available. And in the emptyDir ephemeral volumes will be discussed a bit later. But that’s kind of an example of how this is a service that we can scale easily and quickly and don’t really need to spend that time carefully making sure that we’re not going to accidentally induce it arebalance event because we use this ephemeral storage here. 


And then Gitserver, this is our internal mirroring service. This goes out and clones and mirrors code from various code hosts, be it GitHub, GitLab, Bitbucket, it’s actually directly running Git commands. When Searcher asks for a certain revision, it’s using Git show, when we’re asking for certain contents of the commit, that’s actually running those Git commands. Again, this service is not HA, it’s and this is kind of an important thing I’ll talk about later as we’ve moved to Kubernetes. These are issues that come up during rollout- a single repository lives on a specific Git server. So if we have hundreds of good servers, the service isn’t implicitly HA, or highly available. So when we scale up, we need to be considerate of the fact that it might induce downtime due to rebalancing. 


This is just a small slice of the architecture of Sourcegraph. We have several other stateful subservices within architecture. And the full architecture diagram is available in our handbook, which you can find if you search Google for Sourcegraph handbook. As I mentioned earlier in the talk, I’ll be discussing stateful data in Kubernetes. Stateful applications, in contrast to stateless applications, we have persistence requirements. A lot of times when you hear about persistent data in Kubernetes, you’re gonna hear the concept of volumes, that’s key to know that we’re discussing persistent data, or stateful data. In this part I want to focus on persistent volumes versus ephemeral volumes, and how we use them at Sourcegraph. I just wanted to provide a quick overview of a dynamic persistent volume claim provisioning process, because I think it’s really interesting to kind of know how this works. We have a deployment here, with a volume mount, that mounts some data at the path at slash data. Then we’re gonna have a persistent volume claim that actually goes out and finds the requisition for that data, 200 gigabytes from the default storage class with the rewrite once access mode. And then we have the actual storage class, which in this case, is being backed by the Persistent disk CSI Driver from Google. Assume that this is a GKE based storage class. There are some primers here about what kind of storage you want- balanced, that can also be SSD. And it’s kind of a visual representation of that process of we have a deployment, we have a persistent volume claim, and that storage class. 


When we apply that deployment yaml, we have a pod that’s going to be created. That pod is going to also be applying that persistent volume claim at that same time. And that persistent volume claim with that referencing a storage class is going to trigger that provisioner to go and reach out to something external to the cluster, or whatever service you’re using to create that volume. In this case, it’s going to be in a cloud provider API. We’re actually going to be asking Google Cloud to go allocate that physical disk for us. An important thing to call out here Is that that’s not instantaneous for people who are familiar with stateless applications. Usually there’s not this whole process outside of the cluster where you’re waiting for this persistent disk to spin up. And that’s important to call out because there actually isn’t. Kubernetes has a service level objective of five seconds for stateless pods to spin up. That’s not true for stateful pods, we know there’s a dependency on that external volume creation service. So we can’t actually know for sure how quickly we can spin up these stateful applications, which is something that I feel like trips up a lot of people when they first enter into this. 


Another thing to call out here is some of you might have noticed on the deployment yaml- on the replicas, there is a section here that describes replicas, there’s one replica. That’s going to be an issue actually, for a stateless application having one, not using a StatefulSet here, it’s not a big deal. But when you have a stateful application, when you increase that replicas to two, we’re liable to have some subtle bug cases. If we’re lucky, we’ll actually see that with that pod we’re gonna get a volume multi-attach error because we’ll have two pods that are trying to reference one persistent volume claim. And those are just things that for the more experienced ones out there, it should be obvious, it should be the statefulset from the beginning. But these are things that I think trip up newbies as they’re coming into learning about stateful data. It’s like, why can’t I just increment the value of replicas. You have to be more thoughtful when you’re working with persistent data. 


We’ll talk a little bit about ephemeral volumes. The ephemeral volume that we use most at Sourcegraph is emptyDir(empty directory). Essentially using the root node of the kubelet for storage, the boot disk. That means there’s no waiting for that storage provider, your stateful applications spin up quite a bit faster now. They can act more stateless, in a way they get some higher spin up times. Even now as a ram disk if you need real a ton of IOPS performance. But there are also some cons with that. Small boot disks can fill up, you get pod evictions. And you don’t really understand why you’re if you don’t spend the time to truly understand how empty directories work, your pods get evicted for strange reasons. The other thing too is they always start empty. So you’re always starting with the cold cache. As we were looking into using an empty directory, we saw that hostPath was a possible solution for us because of some of the caching needs that we had. The fact that we run trusted workloads, because hostPath has a lot of interesting security risks, we spent a lot of time looking at that, and found that hostPaths allowed us to get some nice benefits in a similar way to how others use empty directory volumes.


For just a high level overview we saw there’s quite a few challenges in using stateful applications. Just simply because they break assumptions from earlier experiences. A lot of us have seen those demos of fast scaling applications. Those aren’t really true once you start getting to the stateful world, you need to be considerate of how long it takes for your persistent volumes to be created. How long your storage provisioner takes to actually go create those. I think a new perspective that Sourcegraph has is that we are unveiling a public SAAS right now. We have a history of on-premise deployments, so we’ve worked heavily with customers to basically deploy Sourcegraph on their infrastructure. For cloud SAAS, some things that we’ve found is we had a small operations team. We leveraged databases as a service heavily. We have multiple Postgres databases, we took those out of Kubernetes, we put them into Cloud SQL. We have S3 compatible object storage, and that was something where we use GCP buckets for that. And we also are evaluating, doing the same thing for our Redis service. Things that don’t have to be in Kubernetes, since we have small operations it makes sense for us to kind of move those out. We also have and this allows us to focus on our bespoke services where we have these really high storage requirements. For example, GKE specifically that we leverage is local SSDs. These are NVMe drives that are physically attached to the virtual machines, and this allows us access to extremely high performance disks. This is why we also looked at that hostPath mounting, because you need to use hostPaths to access local SSDs. We found it gave us a ton of performance. We had an architecture that already needs a lot of caching. 


Another thing to think of is cloning, so many repositories daily, are sinking. So much data from these external services want to be nice and polite, they don’t want to exceed their API rate limits. So we spend a lot of time heavily caching GitHub, GitLab responses, and various proxy services. And then we also prioritize thinking. Some of the continuing issues that we’re having today is that our engineering team really wants continuous deployments as they should. One of the challenges on the operations side from doing that is because we have these stateful applications that weren’t developed with a mind to continuous deployments, they were meant to be incremental, one version at a time. Most notably, index server and Git server suffer downtime when there’s a rollout because a single repository lives on a single Git server, a single index lives on a single index server. That’s one challenge that we faced. One thing we’ve been looking into heavily is obviously adding application level high availability. Engineering is working very hard on that. 


From the operations side, though, we’re always trying to look into solutions for that. We briefly considered blue green deployments where we duplicate all the data. But as Sourcegraph grew, that just became so expensive that it didn’t make a ton of sense. It took so long that it didn’t make a ton of sense for us at the time. Our current approach is kind of a partial fix. I’ve always had a high number of shards, so that when we do these new these continuous rollouts, and these new versions that are happening multiple times a day, there’s blips of downtime, but only for a certain number of repositories. And not a large swath- only like one out of all a small portion, the application suffers downtime. Another thing that is a technical debt that we’re paying for now is database migrations. In the past, we had applications that would have a major version that would perform our database migrations, and they were part of the product. Now we’re doing continuous deployments, looking to split that out, and make that not a part of the application anymore. 


One of the challenges in on-prem deployments too, we have any cloud compatibility that we’re looking for all those things. We talked about S3, compatibility, Postgres managed services. Sometimes we have customers deploy to data centers, they don’t have access to those things. That meant that we ship for S3 compatible storage, and we ship our own Postgres databases. Another aspect that we’ve come off to in the past was, we have customers who weren’t comfortable using operators when we suggested they were thinking about adding an operator product we asked- How do you feel about that? We had customer pushback, and that’s why we’re able to do it. And then last thing was, there’s just a lot of variation we see in how people use Kubernetes, how they deploy Kubernetes using GitOps using Helm. That’s always been a struggle when you’re trying to deliver an on-prem product. Thank you, everyone.