Launch Coordination Checklist
This is Google’s original Launch Coordination Checklist, circa 2005, slightly abridged for brevity:
Architecture
-
Architecture sketch, types of servers, types of requests from clients
-
Programmatic client requests
Machines and datacenters
-
Machines and bandwidth, datacenters, N+2 redundancy, network QoS
-
New domain names, DNS load balancing
Volume estimates, capacity, and performance
-
HTTP traffic and bandwidth estimates, launch “spike,” traffic mix, 6 months out
-
Load test, end-to-end test, capacity per datacenter at max latency
-
Impact on other services we care most about
-
Storage capacity
System reliability and failover
-
What happens when:
-
Machine dies, rack fails, or cluster goes offline
-
Network fails between two datacenters
-
-
For each type of server that talks to other servers (its backends):
-
How to detect when backends die, and what to do when they die
-
How to terminate or restart without affecting clients or users
-
Load balancing, rate-limiting, timeout, retry and error handling behavior
-
-
Data backup/restore, disaster recovery
Monitoring and server management
-
Monitoring internal state, monitoring end-to-end behavior, managing alerts
-
Monitoring the monitoring
-
Financially important alerts and logs
-
Tips for running servers within cluster environment
-
Don’t crash mail servers by sending yourself email alerts in your own server code
Security
-
Security design review, security code audit, spam risk, authentication, SSL
-
Prelaunch visibility/access control, various types of blacklists
Automation and manual tasks
-
Methods and change control to update servers, data, and configs
-
Release process, repeatable builds, canaries under live traffic, staged rollouts
Growth issues
-
Spare capacity, 10x growth, growth alerts
-
Scalability bottlenecks, linear scaling, scaling with hardware, changes needed
-
Caching, data sharding/resharding
External dependencies
-
Third-party systems, monitoring, networking, traffic volume, launch spikes
-
Graceful degradation, how to avoid accidentally overrunning third-party services
-
Playing nice with syndicated partners, mail systems, services within Google
Schedule and rollout planning
-
Hard deadlines, external events, Mondays or Fridays
-
Standard operating procedures for this service, for other services
Example Production Meeting Minutes
Attendees: agoogler, clarac, docbrown, jennifer, martym
Announcements:
-
Major outage (#465), blew through error budget
Previous Action Item Review
-
Certify Goat Teleporter for use with cattle (bug 1011101)
-
Nonlinearities in mass acceleration now predictable, should be able to target accurately in a few days.
-
Outage Review
-
New Sonnet (outage 465)
-
1.21B queries lost due to cascading failure after interaction between latent bug (leaked file descriptor on searches with no results) + not having new sonnet in corpus + unprecedented & unexpected traffic volume
-
File descriptor leak bug fixed (bug 5554825) and deployed to prod
-
Looking into using flux capacitor for load balancing (bug 5554823) and using load shedding (bug 5554826) to prevent recurrence
-
Annihilated availability error budget; pushes to prod frozen for 1 month unless docbrown can obtain exception on grounds that event was bizarre & unforeseeable (but consensus is that exception is unlikely)
-
Paging Events
-
AnnotationConsistencyTooEventual: paged 5 times this week, likely due to cross-regional replication delay between Bigtables.-
Investigation still ongoing, see bug 4821600
-
No fix expected soon, will raise acceptable consistency threshold to reduce unactionable alerts
-
Nonpaging Events
-
None
Monitoring Changes and/or Silences
-
AnnotationConsistencyTooEventual, acceptable delay threshold raised from 60s to 180s, see bug 4821600; TODO(martym).
Planned Production Changes
-
USA-1 cluster going offline for maintenance between 2015-10-29 and 2015-11-02.
-
No response required, traffic will automatically route to other clusters in region.
-
Resources
-
Borrowed resources to respond to sonnet++ incident, will spin down additional server instances and return resources next week
-
Utilization at 60% of CPU, 75% RAM, 44% disk (up from 40%, 70%, 40% last week)
Key Service Metrics
-
OK 99ile latency: 88 ms < 100 ms SLO target [trailing 30 days]
-
BAD availability: 86.95% < 99.99% SLO target [trailing 30 days]
Discussion / Project Updates
-
Project Molière launching in two weeks.
New Action Items
