our-continuous-deployments-setup

Things we thought we'd need, but haven't

We never needed to change the maximum number of deploys a service could move forward. Three has been fine. That's because we can deploy in two minutes.

Design considerations

Services deploy independently

At one point it wasn't whether a single build should be decided to roll out to individuals services, or if each service should move forward at their own pace. It eventually became clear that each service should be able to move forward at its own pace. This allows for deployment pipelines where prerequisite services can be deployed first.

System visibility

Providing deep insight into what the system observed, and the decisions being made by it, was critically important to people becoming more confident with it. However, we made the mistake of designing that into the system from scratch, but initially only exposing it in the logs. That was just enough of an impediment that people would not want to check, and would prefer to speculate. Integrating that visibility into our web interface, right along with all the other information, made a big difference.

Migrating the org from continuous delivery to continuous deployments

We were already doing continuous delivery, but the deploys were being done by hand.

talk-what-i-wish-i-had-known-before-scaling-uber#migration-mandates

At 40:15 Mandates to migrate are bad. Rather, the new systems should be so much better that people want to get on it.

Continuous deployments proved to be one of those tools. We were very cautious rolling it out, but most teams clamored for it, particularly those with people who had experience with it at other orgs in the past.

Many of the teams opted in very quickly, however not all did. This led to some confusion about which services were automatically being deployed. It makes sense to deploy to be consistent as much as possible.

Stability

Several deploy outages have led us to create out of band checks on the deploys. The system quickly became fully autonomous and people rarely check the dashboards.

It really helps to have one person who tends to the overall system and checks on stalled deploys, in addition to having it each domain and getting emails when their deploys stall.

Our job system runs the check over and over and over again.

The types of checks on the system need to include. The out of band checks need to include both smart checks, that have a deep understanding of how the system works, and double-checks that builds that are viable are deployed, and also dumb checks like whether any service has been deployed by auto bot within the last hour. These checks in concert with the occasional developer commentary, are effective.

How we approached the development

Try to drive any complexity into the decision-maker. I had the benefit of being able to expose the complexity in the decisions that were presented to the users of the web interface.

Driven high-level, real world scenarios. We link from the textual description of those scenarios directly to the test code testing them.

The decision made simply did a deploy as a normal user, albeit one we call autobot. This is helpful because the same sanity checks apply.

Ability to pause individual services, however reliance on the same locks that lock developers from doing deploys, too. Again, the fact that something was not deployed because of the lock is recorded in the decision making.

Deploys are also paused for degradation in the load balancer. And any number of other good reasons.

Special considerations

We are still extracting services from our previous monolith, so that monolith code is still deployed to 10 services. This means that a change to that code causes deploys to 10 services nearly concurrently. So far this hasn't been a problem, but you can see how if a bad build escaped detection, it would be deployed to a large number of our services.

Continuous deployment can be used as a lever to get good practices in place. For example we only continuously deploy services that are running on Docker. We have some legacy services that continue to run on his he two instances, and we do not automatically deployed those.

One of the things we worked hard on it was ensuring that we can fall back to a manual process. I still think that is important, but I was surprised by how after adoption people had very little interest and falling back to the manual process. They didn't want automation. They wanted full autonomy. In other words, they just didn't want to think about deploys.

Surprising Problems

Deploys not going out was a surprising reason for problems. We thought that pausing continuous deploys to get a person to review some questionable state was reasonable. However, things quickly became fire-and-forget, and the expectation became full autonomy.

What we didn't do

After a bunch of discussions, we decided not to have schedules. Having schedules makes it more organizationally complex. Some domains or services would have different schedules from others, which would make it harder to reason about, generally. The simpler thing to reason about is, if you push to master, it goes live.

I found it was helpful to send occasional emails reinforcing the continuous deploys behavior. If you put something on master, it is deployed, weather on the weekend, or on holidays, it doesn't matter.