Book Description
Site reliability engineering (SRE) is more relevant than ever. Knowing how to keep systems reliable has become a critical skill. With this practical book, newcomers and old hats alike will explore a broad range of conversations happening in SRE. You'll get actionable advice on several topics, including how to adopt SRE, why SLOs matter, when you need to upgrade your incident response, and how monitoring and observability differ.
Editors Jaime Woo and Emil Stolarsky, co-founders of Incident Labs, have collected 97 concise and useful tips from across the industry, including trusted best practices and new approaches to knotty problems. You'll grow and refine your SRE skills through sound advice and thought-provokingquestions that drive the direction of the field.
Some of the 97 things you should know:
"Test Your Disaster Plan" - Tanya Reilly
"Integrating Empathy into SRE Tools" - Daniella Niyonkuru
"The Best Advice I Can Give to Teams" - Nicole Forsgren
"Where to SRE" - Fatema Boxwala
"Facing That First Page" - Andrew Louis
"I Have an Error Budget, Now What?" - Alex Hidalgo
"Get Your Work Recognized: Write a Brag Document" - Julia Evans and Karla Burnett
This open book is licensed under a Open Publication License (OPL). You can download 97 Things Every SRE Should Know ebook for free in PDF format (3.4 MB).
Table of Contents
Part I
New to SRE
Chapter 1
Site Reliability Engineering in Six Words
Chapter 2
Do We Know Why We Really Want Reliability?
Chapter 3
Building Self-Regulating Processes
Chapter 4
Four Engineers of an SRE Seder
Chapter 5
The Reliability Stack
Chapter 6
Infrastructure: It's Where the Power Is
Chapter 7
Thinking About Resilience
Chapter 8
Observability in the Development Cycle
Chapter 9
There Is No Magic
Chapter 10
How Wikipedia Is Served to You
Chapter 11
Why You Should Understand (a Little) About TCP
Chapter 12
The Importance of a Management Interface
Chapter 13
When It Comes to Storage, Think Distributed
Chapter 14
The Role of Cardinality
Chapter 15
Security Is like an Onion
Chapter 16
Use Your Words
Chapter 17
Where to SRE
Chapter 18
Dear Future Team
Chapter 19
Sustainability and Burnout
Chapter 20
Don't Take Advice from Graybeards
Chapter 21
Facing That First Page
Part II
Zero to One
Chapter 22
SRE, at Any Size, Is Cultural
Chapter 23
Everyone Is an SRE in a Small Organization
Chapter 24
Auditing Your Environment for Improvements
Chapter 25
With Incident Response, Start Small
Chapter 26
Solo SRE: Effecting Large-Scale Change as a Single Individual
Chapter 27
Design Goals for SLO Measurement
Chapter 28
I Have an Error Budget - Now What?
Chapter 29
How to Change Things
Chapter 30
Methodological Debugging
Chapter 31
How Startups Can Build an SRE Mindset
Chapter 32
Bootstrapping SRE in Enterprises
Chapter 33
It's Okay Not to Know, and It's Okay to Be Wrong
Chapter 34
Storytelling Is a Superpower
Chapter 35
Get Your Work Recognized: Write a Brag Document
Part III
One to Ten
Chapter 36
Making Work Visible
Chapter 37
An Overlooked Engineering Skill
Chapter 38
Unpacking the On-Call Divide
Chapter 39
The Maestros of Incident Response
Chapter 40
Effortless Incident Management
Chapter 41
If You're Doing Runbooks, Do Them Well
Chapter 42
Why I Hate Our Playbooks
Chapter 43
What Machines Do Well
Chapter 44
Integrating Empathy into SRE Tools
Chapter 45
Using ChatOps to Implement Empathy
Chapter 46
Move Fast to Unbreak Things
Chapter 47
You Don't Know for Sure Until It Runs in Production
Chapter 48
Sometimes the Fix Is the Problem
Chapter 49
Legendary
Chapter 50
Metrics Are Not SLIs (The Measure Everything Trap)
Chapter 51
When SLOs Attack: Pathological SLOs and How to Fix Them
Chapter 52
Holistic Approach to Product Reliability
Chapter 53
In Search of the Lost Time
Chapter 54
Unexpected Lessons from Office Hours
Chapter 55
Building Tools for Internal Customers that They Actually Want to Use
Chapter 56
It's About the Individuals and Interactions
Chapter 57
The Human Baseline in SRE
Chapter 58
Remotely Productive or Productively Remote
Chapter 59
Of Margins and Individuals
Chapter 60
The Importance of Margins in Systems
Chapter 61
Fewer Spreadsheets, More Napkins
Chapter 62
Sneaking in Your DevOps Deliciously
Chapter 63
Effecting SRE Cultural Changes in Enterprises
Chapter 64
To All the SREs I've Loved
Chapter 65
Complex: The Most Overloaded Word in Technology
Part IV
Ten to Hundred
Chapter 66
The Best Advice I Can Give to Teams
Chapter 67
Create Your Supporting Artifacts
Chapter 68
The Order of Operations for Getting SLO Buy-In
Chapter 69
Heroes Are Necessary, but Hero Culture Is Not
Chapter 70
On-Call Rotations that People Want to Join
Chapter 71
Study of Human Factors and Team Culture to Improve Pager Fatigue
Chapter 72
Optimize for MTTBTB (Mean Time to Back to Bed)
Chapter 73
Mitigating and Preventing Cascading Failures
Chapter 74
On-Call Health: The Metric You Could Be Measuring
Chapter 75
Helping Leaders Prioritize On-Call Health
Chapter 76
The SRE as a Diplomat
Chapter 77
The Forward-Deployed SRE
Chapter 78
Test Your Disaster Plan
Chapter 79
Why Training Matters to an SRE Practice and SRE Matters to Your Training Program
Chapter 80
The Power of Uniformity
Chapter 81
Bytes per User Value
Chapter 82
Make Your Engineering Blog a Priority
Chapter 83
Don't Let Anyone Run Code in Your Context
Chapter 84
Trading Places: SRE and Product
Chapter 85
You See Teams, I See Product
Chapter 86
The Performance Emergency Fund
Chapter 87
Important but Not Urgent: Roadmaps for SREs
Part V
The Future of SRE
Chapter 88
That 50% Thing
Chapter 89
Following the Path of Safety-Critical Systems
Chapter 90
Applicable and Achievable Static Analysis
Chapter 91
The Importance of Formal Specification
Chapter 92
Risk and Rot in Sociotechnical Systems
Chapter 93
SRE in Crisis
Chapter 94
Expected Risk Limitations
Chapter 95
Beyond Local Risk: Accounting for Angry Birds
Chapter 96
A Word from Software Safety Nerds
Chapter 97
Incidents: A Window into Gaps
Chapter 98
The Third Age of SRE