Launch Coordination Checklist

This is Google’s original Launch Coordination Checklist, circa 2005, slightly abridged for brevity:

Architecture

  • Architecture sketch, types of servers, types of requests from clients

  • Programmatic client requests

Machines and datacenters

  • Machines and bandwidth, datacenters, N+2 redundancy, network QoS

  • New domain names, DNS load balancing

Volume estimates, capacity, and performance

  • HTTP traffic and bandwidth estimates, launch “spike,” traffic mix, 6 months out

  • Load test, end-to-end test, capacity per datacenter at max latency

  • Impact on other services we care most about

  • Storage capacity

System reliability and failover

  • What happens when:

    • Machine dies, rack fails, or cluster goes offline

    • Network fails between two datacenters

  • For each type of server that talks to other servers (its backends):

    • How to detect when backends die, and what to do when they die

    • How to terminate or restart without affecting clients or users

    • Load balancing, rate-limiting, timeout, retry and error handling behavior

  • Data backup/restore, disaster recovery

Monitoring and server management

  • Monitoring internal state, monitoring end-to-end behavior, managing alerts

  • Monitoring the monitoring

  • Financially important alerts and logs

  • Tips for running servers within cluster environment

  • Don’t crash mail servers by sending yourself email alerts in your own server code

Security

  • Security design review, security code audit, spam risk, authentication, SSL

  • Prelaunch visibility/access control, various types of blacklists

Automation and manual tasks

  • Methods and change control to update servers, data, and configs

  • Release process, repeatable builds, canaries under live traffic, staged rollouts

Growth issues

  • Spare capacity, 10x growth, growth alerts

  • Scalability bottlenecks, linear scaling, scaling with hardware, changes needed

  • Caching, data sharding/resharding

External dependencies

  • Third-party systems, monitoring, networking, traffic volume, launch spikes

  • Graceful degradation, how to avoid accidentally overrunning third-party services

  • Playing nice with syndicated partners, mail systems, services within Google

Schedule and rollout planning

  • Hard deadlines, external events, Mondays or Fridays

  • Standard operating procedures for this service, for other services

Example Production Meeting Minutes

Date: 2015-10-23

Attendees: agoogler, clarac, docbrown, jennifer, martym

Announcements:

  • Major outage (#465), blew through error budget

Previous Action Item Review

  • Certify Goat Teleporter for use with cattle (bug 1011101)

    • Nonlinearities in mass acceleration now predictable, should be able to target accurately in a few days.

Outage Review

  • New Sonnet (outage 465)

    • 1.21B queries lost due to cascading failure after interaction between latent bug (leaked file descriptor on searches with no results) + not having new sonnet in corpus + unprecedented & unexpected traffic volume

    • File descriptor leak bug fixed (bug 5554825) and deployed to prod

    • Looking into using flux capacitor for load balancing (bug 5554823) and using load shedding (bug 5554826) to prevent recurrence

    • Annihilated availability error budget; pushes to prod frozen for 1 month unless docbrown can obtain exception on grounds that event was bizarre & unforeseeable (but consensus is that exception is unlikely)

Paging Events

  • AnnotationConsistencyTooEventual: paged 5 times this week, likely due to cross-regional replication delay between Bigtables.

    • Investigation still ongoing, see bug 4821600

    • No fix expected soon, will raise acceptable consistency threshold to reduce unactionable alerts

Nonpaging Events

  • None

Monitoring Changes and/or Silences

  • AnnotationConsistencyTooEventual, acceptable delay threshold raised from 60s to 180s, see bug 4821600; TODO(martym).

Planned Production Changes

  • USA-1 cluster going offline for maintenance between 2015-10-29 and 2015-11-02.

    • No response required, traffic will automatically route to other clusters in region.

Resources

  • Borrowed resources to respond to sonnet++ incident, will spin down additional server instances and return resources next week

  • Utilization at 60% of CPU, 75% RAM, 44% disk (up from 40%, 70%, 40% last week)

Key Service Metrics

  • OK 99ile latency: 88 ms < 100 ms SLO target [trailing 30 days]

  • BAD availability: 86.95% < 99.99% SLO target [trailing 30 days]

Discussion / Project Updates

  • Project Molière launching in two weeks.

New Action Items

  • TODO(martym): Raise AnnotationConsistencyTooEventual threshold.

  • TODO(docbrown): Return instance count to normal and return resources.