Launch Coordination Checklist

This is Google’s original Launch Coordination Checklist, circa 2005, slightly abridged for brevity:

Architecture

  • Architecture sketch, types of servers, types of requests from clients
  • Programmatic client requests

Machines and datacenters

  • Machines and bandwidth, datacenters, N+2 redundancy, network QoS
  • New domain names, DNS load balancing

Volume estimates, capacity, and performance

  • HTTP traffic and bandwidth estimates, launch “spike,” traffic mix, 6 months out
  • Load test, end-to-end test, capacity per datacenter at max latency
  • Impact on other services we care most about
  • Storage capacity

System reliability and failover

  • What happens when:
    • Machine dies, rack fails, or cluster goes offline
    • Network fails between two datacenters
  • For each type of server that talks to other servers (its backends):
    • How to detect when backends die, and what to do when they die
    • How to terminate or restart without affecting clients or users
    • Load balancing, rate-limiting, timeout, retry and error handling behavior
  • Data backup/restore, disaster recovery

Monitoring and server management

  • Monitoring internal state, monitoring end-to-end behavior, managing alerts
  • Monitoring the monitoring
  • Financially important alerts and logs
  • Tips for running servers within cluster environment
  • Don’t crash mail servers by sending yourself email alerts in your own server code

Security

  • Security design review, security code audit, spam risk, authentication, SSL
  • Prelaunch visibility/access control, various types of blacklists

Automation and manual tasks

  • Methods and change control to update servers, data, and configs
  • Release process, repeatable builds, canaries under live traffic, staged rollouts

Growth issues

  • Spare capacity, 10x growth, growth alerts
  • Scalability bottlenecks, linear scaling, scaling with hardware, changes needed
  • Caching, data sharding/resharding

External dependencies

  • Third-party systems, monitoring, networking, traffic volume, launch spikes
  • Graceful degradation, how to avoid accidentally overrunning third-party services
  • Playing nice with syndicated partners, mail systems, services within Google

Schedule and rollout planning

  • Hard deadlines, external events, Mondays or Fridays
  • Standard operating procedures for this service, for other services