Blog Post

20 Essential Books for Site Reliability Engineers

Published

February 26, 2020

mins read

Peter Murray

in this blog post

Site Reliability Engineering (SRE) continues to evolve its practices and expand its presence across different industries. Whether you’re a seasoned SRE or just starting out in the field, the Catchpoint team has compiled this must-read list of site reliability engineering books. The list includes classics and new releases on topics ranging from SRE implementation, systems thinking, post-incident review, SLA management, and more!

Accelerate: Building & Scaling High Accelerate: Building & Scaling High

By Nicole Forsgren, PhD, Jez Humble, & Gene Kim

The promise of BizDevOps has been to cultivate an organization where technology drives business value but missing from the conversations has been the performance of software delivery teams. The publication of Accelerate fills this gap. It is a ground-breaking presentation of four years of research into and statistical analysis of the capabilities and practices most important to the development and delivery of software products.

A Seat at the Table: IT Leadership in the Age of Agility

By Mark Schwartz

Whether IT departments report to CIOs, CISOs, CDOs, or even CTOs (to name only a few), there’s no denying the state of flux that IT leadership finds itself in today. In A Seat at the Table, Schwartz, an experienced CIO, lays out a blueprint for what IT leadership should be: a value creation engine. Part field guide, part manifesto, Schwartz offers a call to action to IT professionals that becoming an Agile IT leader requires the courage to cast off traditional thinking and be willing to fail; a tenet that we know is near and dear to some of today’s most successful SRE teams.

Continuous Delivery: Reliable Software Released through Build, Test, & Deployment Automation

By Jez Humble & David Farley

Your latest software release was designed, developed, and deployed, but it’s not getting in front of your target audience. Figuring out what to do in this scenario, or ideally avoiding this scenario altogether, is the aim of Continuous Delivery. Humble and Farley set out strategic principles and tactical practices that enable continuous, incremental software delivery. For SREs looking to reduce toil, the chapter on automated acceptance testing is a must-read.

Data Visualization: A Handbook for Data Driven Design

By Andy Kirk

In Data Visualisation, Kirk provides a handy resource when deciding what data visualizations to use for drilling down into further analysis of data, post-mortem reviews, SLM negotiations, external presentations, and more.

Foundations of Service Level Management

By Rick Sturm, Wayne Morris & Mary Jander

The rapid rise of SaaS and IT-as-a-service means SLMs and SLAs are even more important now than when Foundations of Service Level Management was released in 2000. The authors present strategies for developing and enforcing SLAs with third-party vendors and service providers. They also provide pertinent insight for us now by showing how vendors and providers can optimize their own practices.

High Performance Web Sites: Essential Knowledge for Front-End Engineers

By Steve Sounders

When Sounders published High Performance Web Sites, this now-classic text in 2007, he shocked many web developers by claiming that the client slide takes up 80% of the time it takes for a web page to load. To reduce response and page load times, Sounders presents 14 specific rules for optimizing website performance. Many of these rules hold true, but SREs looking for even more tips on improving site and application performance should also check out Sounders’ sequel, Even Faster Websites (2011).

site reliability engineering book Inspired

Inspired: How to Create Tech Products Customers Love

By Marty Cagan

The age of customer experience has led to companies competing on offering the best customer experience possible, which has also raised the bar on what customers expect from digital interactions. Originally published ten years ago, the recently released second edition of Inspired is arguably more relevant today since Cagan now provides insight into assembling customer-centric teams and designing, developing, and delivering products that exceed market demand and business objectives.

Platform Revolution (2017)

By Geoffrey G. Parker, Marshall W. Van Alstyne, & Sangeet Paul Choudray

From SaaS to IaaS and now to XaaS, we are inundated with tech acronyms heralding digital disruptions and transformations but have few insights into the mechanisms and behaviors driving these business model changes. In Platform Revolution, the authors take a deep dive into the Platform-as-a-Service phenomenon. They examine the historical context, operational tactics, and economic impact of the emergence of PaaS organizations and their effect on our interactions with technology. A must-read for our brave new world.

Post-Incident Reviews: Learning from Failure for Improved Incident Response

By Jason Hand

While IT environments have drastically changed, the same cannot be said of post-incident reviews. The Post-Incident Reviews report addresses the shortcomings of traditional post-incident review techniques, like root cause analysis, when it comes to understanding and preventing problems from reoccurring in complex, distributed IT systems.

Practical Reliability Engineering, 5th Edition

By Patrick P. O’Conner & Andrew Kleyner

Practical Reliability Engineering presents high-level reliability theory concepts alongside practical real-world applications and industry best practices. This comprehensive approach to reliability will appeal to a wide range of engineering professionals, but SREs will find chapters on software reliability, analyzing reliability data, and maintainability, maintenance, and availability especially insightful.

Principles of Network and System Administration

By Mark Burgess

Released 15 years ago, this foundational text introduces overarching principles and operational tactics for establishing, configuring, and maintaining computer systems and networks. This is a must-have resource for your library whether you’re a seasoned or novice SRE.

Seeking SRE: Conversations About Running Production Systems at Scale

By David N. Blank-Edelman

After the success of Site Reliability Engineering: How Google Runs Production Systems (2016), demand for more SRE content accelerated, especially on nurturing SRE practices at non-tech organizations. Seeking SRE meets this need with essays from nearly 40 SREs and tech professionals following SRE practices. Why we’re recommending this book, however, is that the contributors focus on humans, not technology, in presenting what SREs can do for people.

Site Reliability Engineering: How Google Runs Production Systems (2016)

By Betsy Beyer, Chris Jones, Jennifer Petoff & Niall R. Murphy

When it comes to essential SRE reading there’s no better place to start than with this 2016 collection of essays. Each chapter is rooted in the personal experiences of industry experts involved in putting business-IT into practice at Google. “The most impressive thing of all about this book is its very existence,” observe the editors, who then go on to remind us that “[i]mplementations are ephemeral, but documented reasoning is priceless.” We couldn’t agree more.

site reliability engineering Field guide to understanding human error book

The Field Guide to Understanding ‘Human Error’, 3rd Edition

By Sidney Dekker

While embracing failure is a core tenet among SREs, it can be much more difficult to bring risk-averse business leaders around to realize the long-term value of failure. The Field Guide to ‘Human Error’, now in its third edition, stages an intervention into how organizations perceive ‘human error’ problems. The Field Guide moves the conversation on ‘human error’ forward, rethinking accidents, post-mortems, and our safety systems.

The Human Side of Postmortems

By Dave Zwieback

As reported in the 2019 SRE Report, stress-levels of SREs are at an all-time high. And yet, few how-to’s or guides on running postmortems address how stress and other human factors can contribute to and even prolong an outage. The Human Side of Postmortems makes the case for why SRE and DevOps teams need both a technical and a human postmortem to mitigate stress-induced mistakes during an outage.

The Lean Product Playbook: How to Innovate with Minimum Viable Products & Rapid Customer Feeback

By Dan Olsen

This practical, no-nonsense guide is a great go-to resource for small, or even one-person, SRE teams looking to improve or adopt lean thinking workflows. With step-by-step instructions and processes, The Lean Product Playbook can help SRE teams establish themselves as integral partners in accomplishing organizational objectives in any industry.

The Practice of Cloud System Administration: DevOps and SRE Practiced for Web Services, vol 2

By Thomas A. Limoncelli, Strata R. Chalup & Christina J. Hogan

As more and more organizations migrate to “the cloud,” what can DevOps/SRE principles and practices do to help redefine and reposition Information Technology departments? The authors of this volume provide case studies on operating and running systems at industry giants like Netflix, Etsy, and Amazon while highlighting why distributed systems require a fundamentally different system administration that may not be offered by your cloud services provider.

The Site Reliability Workbook: Practical Ways to Implement SRE

By Betsy Beyer, Niall R. Murphy, David K. Rensin, Kent Kawahara & Stephen Thorne

The highly-anticipated sequel to Site Reliability Engineering (2016) expands upon its predecessor with a hands-on focus that presents concrete examples of SRE in action. “The purpose of this second SRE book is (a) to add more implementation detail to the principles outlined in the first volume,” the editors explain. But for us the second reason is key: “(b) to dispel the idea that SRE is implementable only at ‘Google scale’ or in ‘Google Culture.’”

The Systems Bible: The Beginner’s Guide to Systems Large and Small, 3rd Edition

By John Gall

This systems engineering treatise expands on Gall’s field-defining insights into system failures, which claims that failure is an intrinsic feature of systems. For SREs, The Systems Bible offers 40 chapters on the benefits of conceptualizing systems premised on failure when it comes to measuring, optimizing, and managing systems both big and small.

Thinking, Fast and Slow

By Daniel Kahneman

In Thinking, Fast and Slow, Kahneman presents two systems, one slow, one fast, that drive the way we think, and then examines how these systems guide our professional and personal choices. We recommend the discussion on the inside versus outside view and the problems that arise when teams—or entire organizations—extrapolate and forecast based on only the internal view that fails to account for “unknown unknowns.” The outside view, or what we call the end user experience, provides the baseline needed when making predictions and long-term investments.

book for Site Reliability Engineering Report

2019 SRE Report