SRE

Master this essential documentation concept

Quick Definition

Site Reliability Engineering - a discipline that applies software engineering practices to IT operations, focusing on system reliability, scalability, and incident response.

How SRE Works

graph TD A[Production Incident Detected] --> B{Severity Level?} B -->|SEV-1 Critical| C[Page On-Call SRE] B -->|SEV-2 High| D[Slack Alert to SRE Team] C --> E[Incident Commander Assigned] D --> E E --> F[Mitigation & Rollback] F --> G[Service Restored] G --> H[Blameless Post-Mortem] H --> I[Action Items Created] I --> J[Error Budget Reviewed] J --> K{Budget Remaining?} K -->|Yes| L[Resume Feature Releases] K -->|No| M[Freeze Releases - Focus on Reliability] M --> N[SLO Compliance Restored] L --> N

Understanding SRE

Site Reliability Engineering - a discipline that applies software engineering practices to IT operations, focusing on system reliability, scalability, and incident response.

Key Features

  • Centralized information management
  • Improved documentation workflows
  • Better team collaboration
  • Enhanced user experience

Benefits for Documentation Teams

  • Reduces repetitive documentation tasks
  • Improves content consistency
  • Enables better content reuse
  • Streamlines review processes

Making SRE Knowledge Searchable: From Incident Recordings to Living Documentation

SRE teams naturally generate a lot of video content — recorded incident retrospectives, on-call handoff walkthroughs, postmortem reviews, and reliability training sessions. These recordings capture hard-won operational knowledge: why a particular alert threshold was chosen, how a cascading failure was diagnosed, or what runbook steps actually worked under pressure.

The problem is that video is a poor format for the moment SRE knowledge matters most — during an active incident at 2am. When your on-call engineer needs to recall how a similar database degradation was handled six months ago, scrubbing through a 45-minute retrospective recording is not a realistic option. Critical context stays locked in files that no one has time to watch.

Converting those recordings into structured, searchable documentation changes how your team retains and reuses SRE knowledge. A postmortem walkthrough becomes a queryable incident record. A reliability training session becomes onboarding material a new engineer can actually reference mid-task. Your team stops re-learning the same lessons because the knowledge becomes findable, not just stored.

If your SRE practice relies on recorded meetings and training videos that rarely get revisited, see how video-to-documentation workflows can help surface that knowledge when it counts.

Real-World Documentation Use Cases

Defining SLOs for a Payment Processing API with No Baseline Metrics

Problem

A fintech team launching a new payment API has no historical data to set realistic SLOs. Engineering wants 99.999% uptime, but the ops team has no idea what the system can actually sustain, leading to either over-promising to customers or burning error budgets within days of launch.

Solution

SRE practices introduce a structured SLO definition process using SLIs derived from real user journeys — latency at the 99th percentile, error rate per transaction type, and checkout success rate — giving teams a data-driven baseline before committing to SLAs.

Implementation

['Instrument the payment API with RED metrics (Rate, Errors, Duration) using Prometheus and export dashboards to Grafana for the first 30 days in shadow mode.', 'Identify the top 3 critical user journeys (checkout initiation, payment authorization, refund processing) and define SLIs for each based on observed p99 latency and error rates.', 'Set initial SLOs at 10% below the observed best-case performance (e.g., if p99 latency is 180ms, set SLO at 200ms) and calculate a 30-day error budget.', 'Present SLO thresholds to product and engineering leadership with a burn rate alert policy in PagerDuty, ensuring all stakeholders agree before external SLA commitments are made.']

Expected Outcome

The team launches with SLOs grounded in real system behavior, avoiding a 99.999% SLA that the infrastructure cannot support, and establishes a 30-day error budget of 26 minutes that drives prioritization decisions between reliability work and new feature development.

Reducing Alert Fatigue for an On-Call SRE Rotation Receiving 200+ Alerts Per Week

Problem

An on-call SRE team at a SaaS company receives over 200 PagerDuty alerts per week, but fewer than 15% require human action. Engineers are burning out, ignoring alerts, and missing real incidents buried in noise — resulting in a 45-minute average time-to-detect for genuine outages.

Solution

SRE alert philosophy mandates that every alert must be actionable, urgent, and require human judgment. By auditing alert signal-to-noise ratio and tying alerts directly to SLO burn rates rather than raw metrics, teams eliminate vanity alerts and focus on what actually threatens user experience.

Implementation

["Export 90 days of PagerDuty alert history and categorize each alert as 'actionable' (required a human fix), 'auto-resolved' (system healed itself), or 'noise' (no action taken). Identify that CPU > 80% alerts account for 60% of noise.", 'Replace threshold-based CPU and memory alerts with multi-window burn rate alerts in Prometheus Alertmanager — firing only when the error budget is burning at 2x the sustainable rate over both a 1-hour and 6-hour window.', 'Create a runbook in Confluence for every remaining alert, linking the alert name to a documented response procedure. Retire any alert that cannot be given a specific runbook within 2 weeks.', 'Implement a weekly on-call review meeting where the team votes to silence, tune, or escalate any alert that fired more than 5 times without requiring action in the past 7 days.']

Expected Outcome

Alert volume drops from 200+ to under 40 per week, with 90%+ of remaining alerts requiring genuine human intervention. Mean time-to-detect for SEV-1 incidents decreases from 45 minutes to 8 minutes, and on-call engineer satisfaction scores improve significantly in the next quarterly survey.

Standardizing Blameless Post-Mortems After a Database Outage Caused by a Manual Config Change

Problem

After a 3-hour database outage caused by a misconfigured connection pool setting applied manually in production, the post-incident review devolves into blame directed at the engineer who made the change. No systemic fixes are identified, the same class of error recurs 6 weeks later, and engineers start hiding mistakes rather than reporting them.

Solution

SRE's blameless post-mortem culture reframes incidents as system failures rather than individual failures, using structured templates to extract timeline facts, contributing factors, and systemic action items — making it psychologically safe to surface errors and preventing recurrence through process improvement rather than punishment.

Implementation

['Adopt a standardized post-mortem template in Google Docs or Notion that includes: incident timeline (with UTC timestamps), contributing factors (NOT root cause — SRE avoids single root cause thinking), what went well, what went poorly, and action items with owners and due dates.', 'Establish a post-mortem facilitator role (rotated among senior SREs) who is explicitly not the incident commander, ensuring the person running the meeting has no personal stake in the outcome and can redirect blame language to system observations.', 'Within 48 hours of service restoration, hold a 60-minute post-mortem meeting with all involved engineers, their managers excluded by default, and record the session for async review by the broader SRE team.', "Track all action items in a shared SRE backlog in Jira with a 'Post-Mortem' label, review completion status in weekly SRE syncs, and publish a monthly digest of post-mortem learnings to the entire engineering organization."]

Expected Outcome

The team identifies that the real systemic issue was the absence of a change management process for production database configuration, not the individual engineer's mistake. A Terraform-managed configuration pipeline is implemented, eliminating the entire class of manual config errors. Incident reporting rates increase by 40% as engineers feel safe surfacing near-misses.

Automating Toil Elimination for a Team Spending 60% of On-Call Hours on Manual Deployments

Problem

An SRE team supporting a microservices platform spends 60% of their on-call rotation manually deploying hotfixes, restarting crashed pods, and rotating TLS certificates — repetitive, automatable work that leaves no time for proactive reliability improvements and violates SRE's 50% toil budget rule.

Solution

SRE defines toil as manual, repetitive, automatable work that scales linearly with service growth and has no enduring value. By measuring toil explicitly and applying the 50% engineering time rule, SRE teams create a mandate to automate recurring operational tasks, freeing capacity for reliability engineering work.

Implementation

["For 4 weeks, have every on-call SRE log each task in a shared spreadsheet with: task name, time spent, frequency per month, and a binary 'is this automatable?' flag. Calculate the total toil percentage against total on-call hours.", 'Rank toil items by (time_per_occurrence × monthly_frequency) to identify the highest-impact automation targets. In this case, manual pod restarts (2 hrs/day) and TLS certificate rotation (4 hrs/month) top the list.', "Build a Kubernetes operator using the Operator SDK to automatically detect and restart crash-looping pods based on defined health criteria, and implement cert-manager for automated TLS certificate lifecycle management integrated with Let's Encrypt.", 'Set a team OKR: reduce toil below 30% of on-call hours within 2 quarters, measured by the same weekly logging process, and make toil reduction a standing agenda item in sprint planning to ensure automation work is prioritized alongside feature work.']

Expected Outcome

Manual pod restarts are eliminated entirely through automation, and TLS rotation toil drops from 4 hours to 15 minutes of review per month. On-call toil falls from 60% to 28% of total hours, freeing SREs to implement chaos engineering experiments and capacity planning models that proactively prevent two major outages in the following quarter.

Best Practices

Define SLOs Before Committing to External SLAs

SLOs (Service Level Objectives) are internal reliability targets derived from measured SLIs (Service Level Indicators) that represent real user experience. Committing to customer-facing SLAs without internal SLOs means you have no early warning system before you breach contractual obligations. SLOs should be set slightly more conservative than what the system can sustain, creating a buffer — the error budget — that absorbs normal variance without triggering SLA violations.

✓ Do: Instrument user-facing endpoints with RED metrics first, observe real p50/p95/p99 latency and error rates for 30 days, then set SLOs at a threshold achievable 99%+ of the time based on observed data.
✗ Don't: Do not let product or sales teams dictate SLO targets based on competitive positioning (e.g., '99.99% because our competitor promises it') without validating that your infrastructure architecture can actually sustain that reliability level.

Use Error Budgets to Arbitrate Feature Velocity vs. Reliability Investment

An error budget is the acceptable amount of unreliability calculated from your SLO — a 99.9% SLO gives you 43.8 minutes of downtime per month as your budget. When the budget is healthy, engineering teams can deploy frequently and take calculated risks. When the budget is exhausted or burning fast, reliability work must take priority over new features. This transforms reliability from an abstract goal into a concrete, shared metric that both SRE and product teams own.

✓ Do: Publish error budget burn rate dashboards visible to both SRE and product engineering teams, and establish a written policy that feature releases are frozen when the monthly error budget is more than 50% consumed in the first two weeks.
✗ Don't: Do not treat error budgets as a metric only SREs care about. If product managers cannot explain what the current error budget status means for their release schedule, the error budget policy will be ignored under delivery pressure.

Keep Every On-Call Alert Linked to a Runbook with a Specific Remediation Action

An alert without a runbook forces on-call engineers to diagnose under pressure with no institutional knowledge to guide them, increasing MTTR and the risk of making the incident worse. Every PagerDuty or Opsgenie alert should include a direct URL to a runbook that describes what the alert means, what the blast radius is, and the exact commands or steps to mitigate it. Runbooks should be living documents updated after every incident where the documented procedure was insufficient.

✓ Do: Embed the runbook URL directly in the alert annotation field so it appears in the PagerDuty notification itself. After each incident, add a 'post-incident update' section to the runbook documenting what the on-call engineer actually did versus what the runbook prescribed.
✗ Don't: Do not create runbooks that say 'contact the database team' or 'escalate to senior SRE' as their primary remediation step — these are escalation paths, not runbooks, and they do not reduce MTTR or empower on-call engineers to act independently.

Enforce a 50% Toil Budget and Track Toil Weekly During On-Call Rotations

SRE's foundational principle is that toil — manual, repetitive, automatable operational work — should never exceed 50% of an SRE's working time. Without explicit measurement, toil silently expands to fill all available capacity, leaving no time for the engineering work that reduces future toil. Teams must track toil hours explicitly, not estimate them, and treat toil reduction as a first-class engineering deliverable with sprint tickets and OKRs.

✓ Do: Implement a lightweight toil tracking system (even a shared spreadsheet works) where on-call engineers log each manual task, time spent, and whether it is automatable. Review toil metrics in weekly team syncs and escalate to engineering leadership if toil exceeds 40% for two consecutive weeks.
✗ Don't: Do not accept 'we'll automate it eventually' as a strategy for recurring toil. If a task has been manually performed more than 5 times without an automation ticket being created and prioritized, it is a process failure, not a backlog item.

Conduct Blameless Post-Mortems Within 48 Hours of Every SEV-1 and SEV-2 Incident

Post-mortems conducted more than 48 hours after an incident suffer from memory decay, reconstructed timelines, and reduced emotional urgency that leads to shallow analysis and weak action items. The 48-hour window ensures that contributing factors are fresh, timeline data from logs and monitoring is still easily accessible, and the team's motivation to prevent recurrence is highest. Blameless framing is not about excusing poor decisions — it is about understanding the systemic conditions that made those decisions likely.

✓ Do: Use a structured post-mortem template that explicitly asks 'what conditions made this failure mode possible?' rather than 'who made this mistake?' Assign every action item a specific owner and a due date no longer than 30 days, and track completion in a shared SRE backlog reviewed weekly.
✗ Don't: Do not use post-mortems as performance management tools or include direct management chains in the meeting by default. When engineers fear that post-mortem participation will affect their performance review, they stop surfacing near-misses and the organization loses its most valuable reliability signal.

How Docsie Helps with SRE

Build Better Documentation with Docsie

Join thousands of teams creating outstanding documentation

Start Free Trial