Launch Coordination Checklist

This is Google’s original Launch Coordination Checklist, circa 2005, slightly abridged for brevity:

Architecture

Machines and datacenters

Volume estimates, capacity, and performance

System reliability and failover

What happens when:
- Machine dies, rack fails, or cluster goes offline
- Network fails between two datacenters

For each type of server that talks to other servers (its backends):
- How to detect when backends die, and what to do when they die
- How to terminate or restart without affecting clients or users
- Load balancing, rate-limiting, timeout, retry and error handling behavior
Data backup/restore, disaster recovery

Monitoring and server management

Monitoring internal state, monitoring end-to-end behavior, managing alerts
Monitoring the monitoring
Financially important alerts and logs
Tips for running servers within cluster environment
Don’t crash mail servers by sending yourself email alerts in your own server code

Security

Automation and manual tasks

Methods and change control to update servers, data, and configs
Release process, repeatable builds, canaries under live traffic, staged rollouts

Growth issues

External dependencies

Third-party systems, monitoring, networking, traffic volume, launch spikes
Graceful degradation, how to avoid accidentally overrunning third-party services
Playing nice with syndicated partners, mail systems, services within Google

Schedule and rollout planning

Appendix E - Launch Coordination Checklist