99 Percent Visible: DevOps Reliability

A series of tech talks on building reliable software applications

Now available for on-demand viewing

The Anatomy of Three Incidents

The best response to a system outage is not "What did you do?", but "What did we learn?" This session will walk through three system-wide outages at Google, at Stitch Fix, and at WeWork. In all cases, many things went right and a few went wrong, and after the blameless postmortems, we ended up learning a lot and making substantial improvements in our systems. Looking back, each incident was a seminal event that changed the focus and trajectory of engineering at each organization. You will enjoy a few war stories from the trenches, and leave with a set of actionable suggestions in dealing with customers, engineering teams, and upper management.

Presented by:

Randy Shoup
VP of Engineering and Chief Architect, eBay

The DevOps Toolkit: Catalog, Patterns and Blueprints

Author Viktor Farcic has cataloged the fundamentals of the most essential DevOps tools like GitOps, Kubernetes, Containers as a Service (CaaS), CI/CD.

See Also

Kat Cosgrove
Developer Advocate, jFrog

Learning to Learn by Teaching

I'm a Developer Advocate. That means that ultimately, my job is to teach people things. Over the last year and some change, I've given dozens of talks and workshops about DevOps, the majority of them educational in nature. The way I approach developer education has changed pretty radically over that period of time – more difficult for me in some ways, but better for my audience in every way. The assumptions I make are different now, and the way I communicate has changed, too.

They say you don't truly understand something until you can teach it to someone else, so in the spirit of that, let me teach you what I've learned in my time teaching developers. Everyone learns differently, but everyone also has something they can teach you.

View on-demand ›

Viktor Farcic

Blog: DevOps Hacks "The DevOps Catalog, Patterns, And Blueprints"

Viktor Farcic is a Principal DevOps Architect at Codefresh, a member of the Google Developer Experts and Docker Captains groups, and published author.

Read more ›

Cindy Sridharan

Achieving Zero Downtime

Users expect new features – but they also expect uninterrupted service. Zero downtime deployments are a way to achieve both of these goals. But what does "zero downtime" mean and how can we deliver it? I'll cover a number of different strategies like red/black (blue/green) deployments, rolling deployments, and canaries and whether these truly provide "zero downtime" deployments. I'll talk about how different kinds of infrastructure can help, including service discovery, proxies, and the kernel itself. Finally, I'll also talk about the role of observability in zero downtime deployments – what to measure, what to alert on, and how to debug them.

View on-demand ›

Luis Mineiro

Are We All on the Same Page? Let's Fix That.

The industry has defined it as good practice to have as few alerts as possible, by alerting on symptoms that are associated with end-user pain rather than trying to catch every possible way that pain could be caused.

Organizations with complex distributed systems that span dozens of teams can have a hard time following such practice without burning out the teams owning the client-facing services. A typical solution is to have alerts on all the layers of their distributed systems. This approach almost always leads to an excessive number of alerts and results in alert fatigue.

Adaptive Paging is an alert handler that leverages the causality from tracing and OpenTracing's semantic conventions to page the team closest to the problem. From a single alerting rule, a set of heuristics can be applied to identify the most probable root cause, paging the respective team instead of the alert owner.

View on-demand ›

Alex Hidalgo
Site Reliability Engineer

I Have an SLO. Now What?

It’s 2020: There is a plethora of data available about measuring SLIs and setting SLO targets. But, now that you have this data, what are you actually supposed to do with it? The classic example of “Ship features when you have error budget; focus on reliability when you don’t.” is antiquated, too simple, and ignores all of the amazing discussions and decisions you can have with your SLO data. Let’s talk about how you can use SLOs to actually make people happier — from your customers, to your engineers, to your business.

View on-demand ›

On-Call Me Maybe

Distributed Tracing in Practice: Author Round Table

The authors of Distributed Tracing in Practice (O'Reilly, 2020) come together for a chat on distributed tracing and their experiences writing the book.

View on-demand ›

Software applications are an increasingly important part of business, our economy, and society in general.

At the same time, the complexity that stems from microservices, multi-cloud, and third-party services has left these systems all too prone to failure.


Whether you're an SRE, a devops engineer, or just working on an application where reliability matters, join us for a series of talks to see how others are tackling problems.

Upcoming topics include incident management, measuring and communicating about reliability (including SLIs and SLOs), testing in production, automation, building a culture of reliability, and eliminating toil throughout your organizations.

© 2021 Lightstep, Inc.