Master this essential documentation concept
Site Reliability Engineering - a discipline that applies software engineering practices to IT operations, focusing on system reliability, scalability, and incident response.
Site Reliability Engineering - a discipline that applies software engineering practices to IT operations, focusing on system reliability, scalability, and incident response.
SRE teams naturally generate a lot of video content — recorded incident retrospectives, on-call handoff walkthroughs, postmortem reviews, and reliability training sessions. These recordings capture hard-won operational knowledge: why a particular alert threshold was chosen, how a cascading failure was diagnosed, or what runbook steps actually worked under pressure.
The problem is that video is a poor format for the moment SRE knowledge matters most — during an active incident at 2am. When your on-call engineer needs to recall how a similar database degradation was handled six months ago, scrubbing through a 45-minute retrospective recording is not a realistic option. Critical context stays locked in files that no one has time to watch.
Converting those recordings into structured, searchable documentation changes how your team retains and reuses SRE knowledge. A postmortem walkthrough becomes a queryable incident record. A reliability training session becomes onboarding material a new engineer can actually reference mid-task. Your team stops re-learning the same lessons because the knowledge becomes findable, not just stored.
If your SRE practice relies on recorded meetings and training videos that rarely get revisited, see how video-to-documentation workflows can help surface that knowledge when it counts.
A fintech team launching a new payment API has no historical data to set realistic SLOs. Engineering wants 99.999% uptime, but the ops team has no idea what the system can actually sustain, leading to either over-promising to customers or burning error budgets within days of launch.
SRE practices introduce a structured SLO definition process using SLIs derived from real user journeys — latency at the 99th percentile, error rate per transaction type, and checkout success rate — giving teams a data-driven baseline before committing to SLAs.
['Instrument the payment API with RED metrics (Rate, Errors, Duration) using Prometheus and export dashboards to Grafana for the first 30 days in shadow mode.', 'Identify the top 3 critical user journeys (checkout initiation, payment authorization, refund processing) and define SLIs for each based on observed p99 latency and error rates.', 'Set initial SLOs at 10% below the observed best-case performance (e.g., if p99 latency is 180ms, set SLO at 200ms) and calculate a 30-day error budget.', 'Present SLO thresholds to product and engineering leadership with a burn rate alert policy in PagerDuty, ensuring all stakeholders agree before external SLA commitments are made.']
The team launches with SLOs grounded in real system behavior, avoiding a 99.999% SLA that the infrastructure cannot support, and establishes a 30-day error budget of 26 minutes that drives prioritization decisions between reliability work and new feature development.
An on-call SRE team at a SaaS company receives over 200 PagerDuty alerts per week, but fewer than 15% require human action. Engineers are burning out, ignoring alerts, and missing real incidents buried in noise — resulting in a 45-minute average time-to-detect for genuine outages.
SRE alert philosophy mandates that every alert must be actionable, urgent, and require human judgment. By auditing alert signal-to-noise ratio and tying alerts directly to SLO burn rates rather than raw metrics, teams eliminate vanity alerts and focus on what actually threatens user experience.
["Export 90 days of PagerDuty alert history and categorize each alert as 'actionable' (required a human fix), 'auto-resolved' (system healed itself), or 'noise' (no action taken). Identify that CPU > 80% alerts account for 60% of noise.", 'Replace threshold-based CPU and memory alerts with multi-window burn rate alerts in Prometheus Alertmanager — firing only when the error budget is burning at 2x the sustainable rate over both a 1-hour and 6-hour window.', 'Create a runbook in Confluence for every remaining alert, linking the alert name to a documented response procedure. Retire any alert that cannot be given a specific runbook within 2 weeks.', 'Implement a weekly on-call review meeting where the team votes to silence, tune, or escalate any alert that fired more than 5 times without requiring action in the past 7 days.']
Alert volume drops from 200+ to under 40 per week, with 90%+ of remaining alerts requiring genuine human intervention. Mean time-to-detect for SEV-1 incidents decreases from 45 minutes to 8 minutes, and on-call engineer satisfaction scores improve significantly in the next quarterly survey.
After a 3-hour database outage caused by a misconfigured connection pool setting applied manually in production, the post-incident review devolves into blame directed at the engineer who made the change. No systemic fixes are identified, the same class of error recurs 6 weeks later, and engineers start hiding mistakes rather than reporting them.
SRE's blameless post-mortem culture reframes incidents as system failures rather than individual failures, using structured templates to extract timeline facts, contributing factors, and systemic action items — making it psychologically safe to surface errors and preventing recurrence through process improvement rather than punishment.
['Adopt a standardized post-mortem template in Google Docs or Notion that includes: incident timeline (with UTC timestamps), contributing factors (NOT root cause — SRE avoids single root cause thinking), what went well, what went poorly, and action items with owners and due dates.', 'Establish a post-mortem facilitator role (rotated among senior SREs) who is explicitly not the incident commander, ensuring the person running the meeting has no personal stake in the outcome and can redirect blame language to system observations.', 'Within 48 hours of service restoration, hold a 60-minute post-mortem meeting with all involved engineers, their managers excluded by default, and record the session for async review by the broader SRE team.', "Track all action items in a shared SRE backlog in Jira with a 'Post-Mortem' label, review completion status in weekly SRE syncs, and publish a monthly digest of post-mortem learnings to the entire engineering organization."]
The team identifies that the real systemic issue was the absence of a change management process for production database configuration, not the individual engineer's mistake. A Terraform-managed configuration pipeline is implemented, eliminating the entire class of manual config errors. Incident reporting rates increase by 40% as engineers feel safe surfacing near-misses.
An SRE team supporting a microservices platform spends 60% of their on-call rotation manually deploying hotfixes, restarting crashed pods, and rotating TLS certificates — repetitive, automatable work that leaves no time for proactive reliability improvements and violates SRE's 50% toil budget rule.
SRE defines toil as manual, repetitive, automatable work that scales linearly with service growth and has no enduring value. By measuring toil explicitly and applying the 50% engineering time rule, SRE teams create a mandate to automate recurring operational tasks, freeing capacity for reliability engineering work.
["For 4 weeks, have every on-call SRE log each task in a shared spreadsheet with: task name, time spent, frequency per month, and a binary 'is this automatable?' flag. Calculate the total toil percentage against total on-call hours.", 'Rank toil items by (time_per_occurrence × monthly_frequency) to identify the highest-impact automation targets. In this case, manual pod restarts (2 hrs/day) and TLS certificate rotation (4 hrs/month) top the list.', "Build a Kubernetes operator using the Operator SDK to automatically detect and restart crash-looping pods based on defined health criteria, and implement cert-manager for automated TLS certificate lifecycle management integrated with Let's Encrypt.", 'Set a team OKR: reduce toil below 30% of on-call hours within 2 quarters, measured by the same weekly logging process, and make toil reduction a standing agenda item in sprint planning to ensure automation work is prioritized alongside feature work.']
Manual pod restarts are eliminated entirely through automation, and TLS rotation toil drops from 4 hours to 15 minutes of review per month. On-call toil falls from 60% to 28% of total hours, freeing SREs to implement chaos engineering experiments and capacity planning models that proactively prevent two major outages in the following quarter.
SLOs (Service Level Objectives) are internal reliability targets derived from measured SLIs (Service Level Indicators) that represent real user experience. Committing to customer-facing SLAs without internal SLOs means you have no early warning system before you breach contractual obligations. SLOs should be set slightly more conservative than what the system can sustain, creating a buffer — the error budget — that absorbs normal variance without triggering SLA violations.
An error budget is the acceptable amount of unreliability calculated from your SLO — a 99.9% SLO gives you 43.8 minutes of downtime per month as your budget. When the budget is healthy, engineering teams can deploy frequently and take calculated risks. When the budget is exhausted or burning fast, reliability work must take priority over new features. This transforms reliability from an abstract goal into a concrete, shared metric that both SRE and product teams own.
An alert without a runbook forces on-call engineers to diagnose under pressure with no institutional knowledge to guide them, increasing MTTR and the risk of making the incident worse. Every PagerDuty or Opsgenie alert should include a direct URL to a runbook that describes what the alert means, what the blast radius is, and the exact commands or steps to mitigate it. Runbooks should be living documents updated after every incident where the documented procedure was insufficient.
SRE's foundational principle is that toil — manual, repetitive, automatable operational work — should never exceed 50% of an SRE's working time. Without explicit measurement, toil silently expands to fill all available capacity, leaving no time for the engineering work that reduces future toil. Teams must track toil hours explicitly, not estimate them, and treat toil reduction as a first-class engineering deliverable with sprint tickets and OKRs.
Post-mortems conducted more than 48 hours after an incident suffer from memory decay, reconstructed timelines, and reduced emotional urgency that leads to shallow analysis and weak action items. The 48-hour window ensures that contributing factors are fresh, timeline data from logs and monitoring is still easily accessible, and the team's motivation to prevent recurrence is highest. Blameless framing is not about excusing poor decisions — it is about understanding the systemic conditions that made those decisions likely.
Join thousands of teams creating outstanding documentation
Start Free Trial