SRE Classroom: Distributed PubSub

Introduction

SRE Classroom: Distributed PubSub is a workshop developed by Google’s Site Reliability Engineering group. The goals of this workshop are to (1) introduce participants to the principles of non-abstract large systems design (NALSD), and (2) provide hands-on experiences with applying these principles to the design and evaluation of these systems. We consider NALSD a concept fundamental to SRE, and understanding its principles provides a basis for having meaningful conversations about the design and operation of large software systems.

In the first theoretical part of the workshop, participants learn about some foundational large system design principles and concepts. Topics include correctness, reliability, performance, different inter-system communication styles, and more. We introduce the problem requirements in detail and walk through the first parts of an example solution.

The practical part of this workshop asks participants to apply the principles they have learned to develop a Publish-Subscribe system that meets certain performance and correctness requirements and Service Level Objectives (SLOs).

The workshop concludes with a detailed example solution, as well as a discussion of the system’s inputs and SLOs.

Target Audience

This workshop includes technical content, and its primary audience is software developers and site reliability engineers. We have also welcomed folks in various other roles, including product management and senior engineering management, to this workshop.

The workshop includes hands-on work well-suited for groups of five, and scales well from 1 to 20 groups—as many as a hundred participants!

Workshop Materials

This presentation is the backbone of the workshop. It contains the training content that prepares participants for the practical exercises. There are detailed speaker notes for presenters that make it possible to deliver the workshop with minimum preparation. We also provide a Presenter Guide with additional tips and guidance for leading the workshop.

The Participant Handout contains additional details about the exercise. The Latency Numbers Everyone Should Know handout contains reference numbers that are useful for back-of-the-envelope calculations. The NALSD Workbook contains reference material that is useful both during the workshop and more generally when applying the NALSD approach to solving system design problems.

The Facilitator Guide contains tips and guidance for facilitators of the workshop. Facilitators should read this ahead of time to prepare for making the workshop an awesome experience for everyone involved. The breakout template can be used to set up breakout groups during the hands-on portion of the workshop. This preparation step can be done by either the facilitators or the presenter – be sure to coordinate and make a game plan ahead of time!

Additional Resources

We aim to develop durable SRE Classroom materials for folks learning about NALSD. If you find this useful, tell us what you want to see in future exercises. Please use the issue tracker to send us your thoughts and suggestions. Alternatively, send us a tweet at @googlesre.

Want to learn more about NALSD? Here are some additional resources to explore:

Licensing

The workshop documents above are released under the Creative Commons CC-BY-4.0 license for anyone to use and reuse, as long as Google is credited as the original author. If you want to suggest improvements, have any problems with the content, or just want to ask a question, please create a bug in our issue tracker component.