Launch Coordination Checklist

This is Google’s original Launch Coordination Checklist, circa 2005, slightly abridged for brevity:

Architecture

  • Architecture sketch, types of servers, types of requests from clients

  • Programmatic client requests

Machines and datacenters

  • Machines and bandwidth, datacenters, N+2 redundancy, network QoS

  • New domain names, DNS load balancing

Volume estimates, capacity, and performance

  • HTTP traffic and bandwidth estimates, launch “spike,” traffic mix, 6 months out

  • Load test, end-to-end test, capacity per datacenter at max latency

  • Impact on other services we care most about

  • Storage capacity

System reliability and failover

  • What happens when:

    • Machine dies, rack fails, or cluster goes offline

    • Network fails between two datacenters

  • For each type of server that talks to other servers (its backends):

    • How to detect when backends die, and what to do when they die

    • How to terminate or restart without affecting clients or users

    • Load balancing, rate-limiting, timeout, retry and error handling behavior

  • Data backup/restore, disaster recovery

Monitoring and server management

  • Monitoring internal state, monitoring end-to-end behavior, managing alerts

  • Monitoring the monitoring

  • Financially important alerts and logs

  • Tips for running servers within cluster environment

  • Don’t crash mail servers by sending yourself email alerts in your own server code

Security

  • Security design review, security code audit, spam risk, authentication, SSL

  • Prelaunch visibility/access control, various types of blacklists

Automation and manual tasks

  • Methods and change control to update servers, data, and configs

  • Release process, repeatable builds, canaries under live traffic, staged rollouts

Growth issues

  • Spare capacity, 10x growth, growth alerts

  • Scalability bottlenecks, linear scaling, scaling with hardware, changes needed

  • Caching, data sharding/resharding

External dependencies

  • Third-party systems, monitoring, networking, traffic volume, launch spikes

  • Graceful degradation, how to avoid accidentally overrunning third-party services

  • Playing nice with syndicated partners, mail systems, services within Google

Schedule and rollout planning

  • Hard deadlines, external events, Mondays or Fridays

  • Standard operating procedures for this service, for other services