A series of tech talks on building reliable software applications
Whether you're an SRE, a devops engineer, or just working on an application where reliability matters, join us for a series of talks to see how others are tackling problems.
Upcoming topics include incident management, measuring and communicating about reliability (including SLIs and SLOs), testing in production, automation, building a culture of reliability, and eliminating toil throughout your organizations.
February 16, 2021 - 9am PT | 12pm ET
Like many other companies in the DevOps sphere, we realized early on that compliance can be a serious obstacle to the progress of our sales cycle. Having long-standing experience with security, but none at all with compliance, we set out to become SOC 2 compliant in our software development process.
We quickly learned that there was very little public documentation on how to become SOC 2 compliant. In this session, I will share the way we built the SOC 2 procedures around agile software development and DevOps patterns such as CI/CD and GitOps. Although it typically takes about a year to complete a SOC 2 compliance, we have managed to get certified in less than six months.
During this process, we have come to an important conclusion, one which I hope you will arrive at as well by the end of this session. You will learn how agile processes and DevOps can address and outperform traditional methods for managing security and compliance. This talk will empower you to tailor your enterprise compliance needs to your desired software development process. In short, software-oriented organizations can have the cake and eat it too.
Users expect new features – but they also expect uninterrupted service. Zero downtime deployments are a way to achieve both of these goals. But what does "zero downtime" mean and how can we deliver it? I'll cover a number of different strategies like red/black (blue/green) deployments, rolling deployments, and canaries and whether these truly provide "zero downtime" deployments. I'll talk about how different kinds of infrastructure can help, including service discovery, proxies, and the kernel itself. Finally, I'll also talk about the role of observability in zero downtime deployments – what to measure, what to alert on, and how to debug them.
The industry has defined it as good practice to have as few alerts as possible, by alerting on symptoms that are associated with end-user pain rather than trying to catch every possible way that pain could be caused.
Organizations with complex distributed systems that span dozens of teams can have a hard time following such practice without burning out the teams owning the client-facing services. A typical solution is to have alerts on all the layers of their distributed systems. This approach almost always leads to an excessive number of alerts and results in alert fatigue.
Adaptive Paging is an alert handler that leverages the causality from tracing and OpenTracing's semantic conventions to page the team closest to the problem. From a single alerting rule, a set of heuristics can be applied to identify the most probable root cause, paging the respective team instead of the alert owner.
Site Reliability Engineer
It’s 2020: There is a plethora of data available about measuring SLIs and setting SLO targets. But, now that you have this data, what are you actually supposed to do with it? The classic example of “Ship features when you have error budget; focus on reliability when you don’t.” is antiquated, too simple, and ignores all of the amazing discussions and decisions you can have with your SLO data. Let’s talk about how you can use SLOs to actually make people happier — from your customers, to your engineers, to your business.
On-Call Me Maybe