The SRE Reading List: Books and Resources That Actually Help You Keep Things Running

A small disclaimer before we begin. I am writing this in April 2026, and I am fully aware that some version of a young engineer in 2031 will stumble on this post and react the way a bored schoolkid does on a museum field trip: “Lol, they used to draw bison on cave walls.” Yes, I am about to recommend books — physical, dead-tree, thousand-page books — as the best way to learn about systems. In an age where an AI agent can explain TCP congestion control in thirty seconds. I know how this looks. But here is my bet: the agent learned it from these same books, and if you want to actually understand what it is telling you, you will too. So, cave walls it is.

There is a well-known tradition at Google: when you get invited to an SRE interview, the recruiter sends you an email with a reading list. The books on that list are not interview tricks or leetcode shortcuts — they are the foundational texts that working SREs rely on every day. TCP/IP, Linux internals, systems programming, operating systems.

I have been building monitoring and infrastructure tools for over a decade, and I can tell you from experience: the difference between teams that maintain 99.99% uptime and teams that struggle at 99% almost always comes down to how well they understand what happens below the application layer. When your service is slow and Grafana shows nothing wrong, can you check whether TCP retransmits are eating your latency? When a deploy causes intermittent 502s, do you know whether it is a socket exhaustion problem or a DNS TTL issue? This kind of knowledge does not come from frameworks or cloud dashboards. It comes from books.

Here is the reading list — both the industry-standard recommendations and the resources I personally used to build the foundation that eventually led me to create Oack.

The Google SRE canon

Google publishes three books on SRE practices, all available free at sre.google/books:

Site Reliability Engineering: How Google Runs Production Systems (2016) — The original SRE book. Covers error budgets, SLOs, toil reduction, release engineering, and on-call practices. If you read one book on this list, make it this one. It will reshape how you think about reliability as an engineering problem, not an ops burden.

The Site Reliability Workbook (2018) — The practical companion. Where the first book explains the philosophy, this one shows how to implement it. SLO workshops, alerting strategies, incident response walkthroughs.

Building Secure and Reliable Systems (2020) — Where security meets reliability. Covers design patterns for systems that need to be both secure and available. Particularly relevant if you are building anything that handles user data or financial transactions.

Networking: where uptime lives or dies

Most outages that are hard to diagnose live in the network layer. The application returns 200 OK, the database is healthy, but users in a specific region experience 3-second page loads. Understanding TCP/IP is not optional for an SRE — it is the difference between “I don’t know why it’s slow” and “the congestion window is collapsing because of packet loss on the path to the Frankfurt PoP.”

The TCP/IP Guide by Charles M. Kozierok — This is my personal recommendation for learning TCP/IP from scratch. Yes, the website looks like it was designed in 1997 and has not been updated since. The CSS predates CSS. I am fairly sure the hit counter at the bottom is real. But here is the thing: every time I open it, I get a warm wave of nostalgia for the internet of my youth — when websites were made by people who cared more about content than gradients. And the content here is genuinely excellent. Kozierok explains every protocol from Ethernet framing to TCP congestion control with clarity that most textbooks cannot match. If you can get past the aesthetic, you will learn more about TCP/IP from this site than from most university courses.

TCP/IP Illustrated, Volume 1 by W. Richard Stevens — The classic. Stevens had a rare gift: he could take something as dry as packet headers and make you want to keep reading. The illustrations are not decorative — they are pedagogical. Every diagram teaches you something. This book is the reason a generation of engineers actually understands what happens when you type curl and hit enter.

Systems programming: the books that changed me

If the networking books teach you what happens on the wire, these books teach you what happens inside the machine. Processes, file descriptors, signals, sockets, memory mapping — the building blocks that every production system is made of.

Advanced Programming in the UNIX Environment by W. Richard Stevens and Stephen A. Rago (3rd Edition) — I want to say something warm about this book because it deserves it. Stevens wrote the original edition, and after his passing in 1999, Rago carried the work forward with enormous care and respect. The result is a book that feels both timeless and practical. Every chapter is dense with knowledge that you will use for the rest of your career. File I/O, process control, signals, threads, IPC — every fundamental of UNIX systems programming, explained with precision and depth. If you write software that runs on Linux in production, you owe it to yourself to read this cover to cover at least once.

UNIX Network Programming, Volume 1 by W. Richard Stevens — The companion piece. Where Advanced Programming in the UNIX Environment teaches you how the OS works, this book teaches you how to talk to the network from your code. Yes, it does not cover modern Linux primitives like io_uring — but that is not the point. The book walks you through building a network application using every I/O model the OS provides, from the simplest blocking sockets to select, poll, non-blocking I/O, signal-driven I/O, and the event-driven patterns that led to epoll. By the time you finish, you will have a visceral understanding of why coroutines exist and what problem they solve — which, incidentally, is the real answer to that interview question about coroutines that trips up so many candidates. Stevens passed away far too young, but the clarity and generosity of his writing lives on in every engineer who learned from these pages. Few authors have done more for the systems programming community.

Operating systems

Operating Systems: Three Easy Pieces by Remzi and Andrea Arpaci-Dusseau — Free online, and genuinely enjoyable to read. Covers virtualization, concurrency, and persistence in a way that makes you understand why things work the way they do, not just how. If you never took an OS course or want a refresher, start here.

Performance analysis

Brendan Gregg’s perf page — Not a book, but an essential resource. Brendan Gregg (formerly Netflix, now Intel) is the person who popularized flame graphs and made Linux performance analysis accessible to mortals. His perf reference page is the single best starting point for understanding CPU profiling, tracing, and event counting on Linux. Bookmark it. You will come back to it every time you need to figure out why a process is burning CPU or why context switches are through the roof.

Gregg also wrote Systems Performance: Enterprise and the Cloud — if you want the full book treatment of performance methodology, this is it.

Algorithms and interview prep

If you are specifically preparing for a Google SRE interview, the recruiter email also recommends:

Introduction to Algorithms (CLRS) by Cormen, Leiserson, Rivest, Stein — The standard algorithms textbook. Heavy, thorough, and worth having on your shelf even if you only reference specific chapters.

Cracking the Coding Interview by Gayle Laakmann McDowell — For the coding interview portion specifically. The SRE interview at Google includes coding rounds alongside systems design and troubleshooting, so this is relevant.

Why this matters for uptime

You might be wondering: why is a monitoring company writing about systems programming books? Because monitoring is only as useful as the person reading the alerts.

When Oack tells you that TCP retransmits spiked to 12% on your API server in Singapore, that information is actionable only if you understand what retransmits mean and where to look next. When the probe waterfall shows 800ms in TLS handshake time, you need to know whether that is a certificate chain issue, a CDN misconfiguration, or just physics (your server is in Virginia and your user is in Tokyo).

We built TCP-level telemetry, probe waterfalls, and Cloudflare enrichment into Oack because we believe monitoring should give you the data you need to diagnose, not just detect. But the data is only half the story. The other half is the knowledge to interpret it.

Read the books. Understand the stack. The uptime will follow.