What is SRE?
SRE is what you get when you treat operations as if it’s a software problem. Our mission is to protect, provide for, and progress the software and systems behind all of Google’s public services — Google Search, Ads, Gmail, Android, YouTube, and AppEngine, to name just a few — with an ever-watchful eye on their availability, latency, performance, and capacity.
Our job is a combination not found elsewhere in the industry. Like traditional operations groups, we keep important, revenue-critical systems up and running despite hurricanes, bandwidth outages, and configuration errors. Unlike traditional operations groups, we view software as the primary tool through which our systems are managed, maintained, and minded; to that end, we have the source-level access and moral authority required to fix, extend and scale code to keep it working, harden it against the vagaries of the Internet, and develop our own planet-scale platforms. We hire people from both systems and software backgrounds, and an informed mix is even better. Just as what we do is unique, where we do it is unique too. In Google, we have the good fortune to have developed many large systems ranging from planet-spanning databases to near real-time scalable data warehousing to fault-tolerant datastream joining. In SRE, we flip between the fine-grained detail of disk driver IO scheduling to the big picture of continental-level service capacity, across a range of systems and a user population measured in billions. We own those products in production. We drive reliability and performance across massive scale by mastering the full depth of the stack. We literally do learn something new every day – usually surprising things – and (for algorithm fans) there isn’t a small N anywhere in our job.
"Site Reliability Engineering seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users’ overall happiness—with features, service, and performance—is optimized."Marc Alvidrez, Site Reliability Engineer, Mountain View
"Fundamentally, it's what happens when you ask a software engineer to design an operations function."Ben Treynor Sloss, Vice President, Google Engineering, Founder of Google SRE
"Here’s what you do when someone breaks something or finds something very difficult to debug: You say thank you. Thank you for finding this edge case. Thank you for highlighting this overcomplicated part of our system. Thank you for pointing out this gap in our docs. And then you go make it so nobody can break it the same way again."Tanya Reilly, Site Reliability Engineer, New York City
"SREs engineer services, instead of binaries. This is a shift in perspective that exploits unusual skills and creativity. SREs are specialists in making changes safely."John T. Reese, Site Reliability Engineer, San Francisco
Hear from our SREs
Hear four veteran Googlers describe their experiences as SREs: how their backgrounds led them to their current roles, what their day-to-day work looks like, and how they've seen the core questions SRE tackles (stability vs. agility, operational work vs. software engineering, proactive vs. reactive work) play out.
Interested in joining SRE?
Google strives to cultivate an inclusive workplace. We believe diversity of perspectives and ideas leads to better discussions, decisions, and outcomes for everyone.Visit SRE careers page