API Rate Limiting

Master this essential documentation concept

Quick Definition

A technique used to control how many API requests a user or system can make within a defined time period, preventing overuse and ensuring service stability.

How API Rate Limiting Works

sequenceDiagram participant Client as API Client participant Gateway as API Gateway participant Counter as Rate Limit Counter participant Backend as Backend Service Client->>Gateway: POST /api/data (Request #1) Gateway->>Counter: Check request count for Client ID Counter-->>Gateway: 45/60 requests used Gateway->>Backend: Forward request Backend-->>Gateway: 200 OK Response Gateway-->>Client: 200 OK (X-RateLimit-Remaining: 15) Client->>Gateway: POST /api/data (Request #61) Gateway->>Counter: Check request count for Client ID Counter-->>Gateway: 60/60 requests used (LIMIT REACHED) Gateway-->>Client: 429 Too Many Requests (Retry-After: 30s) Note over Client,Gateway: Client must wait for window reset Note over Counter: 60-second window resets Client->>Gateway: POST /api/data (Request #1 new window) Gateway->>Counter: Check request count for Client ID Counter-->>Gateway: 1/60 requests used Gateway->>Backend: Forward request Backend-->>Gateway: 200 OK Response Gateway-->>Client: 200 OK (X-RateLimit-Remaining: 59)

Understanding API Rate Limiting

A technique used to control how many API requests a user or system can make within a defined time period, preventing overuse and ensuring service stability.

Key Features

  • Centralized information management
  • Improved documentation workflows
  • Better team collaboration
  • Enhanced user experience

Benefits for Documentation Teams

  • Reduces repetitive documentation tasks
  • Improves content consistency
  • Enables better content reuse
  • Streamlines review processes

Turn Videos into Engineering Documents

Use Docsie to convert training videos, screen recordings, and Zoom calls into ready-to-publish engineering templates. Download free templates below, or generate documentation from video.

Documenting API Rate Limiting Policies from Team Recordings

When your team establishes API rate limiting thresholds, the decisions behind those limits — why 1,000 requests per minute was chosen over 500, how burst allowances were calculated, what happens when a client exceeds the quota — often get explained once in a technical walkthrough or architecture review and never written down. That institutional knowledge lives in a recording that nobody can search.

The practical problem surfaces when a developer needs to integrate with your API at 11pm and hits a 429 error they don't understand, or when a new team member needs to explain your rate limiting strategy to a client. Scrubbing through a 45-minute onboarding video to find the two minutes where someone explains the retry-after header behavior is not a sustainable workflow.

Converting those recordings into structured documentation changes how your team works with this concept. Instead of rewatching a sprint demo, a developer can search for "rate limit headers" and land directly on the relevant section — complete with the context your architect explained verbally, now captured as readable, linkable reference material. Your API rate limiting policies become something you can version, update, and share alongside your API docs rather than something buried in a video archive.

If your team regularly explains technical policies like this in meetings or training sessions, see how converting those recordings into searchable documentation can close that knowledge gap →

Real-World Documentation Use Cases

Protecting a Public Weather API from Scraper Abuse During Viral Events

Problem

A public weather API experiences 50x normal traffic spikes when a major storm goes viral on social media. Uncontrolled scrapers and bots hammer the endpoint, causing latency to spike from 80ms to 8 seconds for legitimate paying customers and eventually crashing the backend database.

Solution

API rate limiting enforces per-API-key quotas of 100 requests per minute for free-tier users and 1,000 requests per minute for paid subscribers, ensuring scrapers hitting the limit receive 429 responses while paying customers maintain consistent throughput.

Implementation

['Define tiered rate limit policies in the API gateway (e.g., Kong or AWS API Gateway): free=100 req/min, pro=1000 req/min, enterprise=unlimited with burst allowance.', 'Assign unique API keys to each consumer and tag keys with their tier; configure the gateway to track request counts per key using a sliding window algorithm in Redis.', 'Return standardized 429 responses with headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, and Retry-After so clients can self-throttle.', 'Set up alerting dashboards in Datadog to monitor rate-limit hit rates by tier and identify abusive API keys for potential suspension or upselling.']

Expected Outcome

API p99 latency stays below 200ms even during viral traffic events, paying customers experience zero degradation, and abuse incidents are automatically contained without manual intervention.

Preventing a CI/CD Pipeline from Exhausting GitHub API Quota During Mass Deployments

Problem

An engineering team runs 200+ microservices with CI/CD pipelines that all call the GitHub REST API to fetch commit metadata, check PR status, and post deployment comments. During a large release, all pipelines trigger simultaneously and exhaust the 5,000 requests/hour OAuth token limit, causing pipelines to fail with cryptic authentication errors.

Solution

Implementing client-side rate limiting with exponential backoff in the CI/CD tooling, combined with a shared token pool and server-side quota monitoring, ensures pipelines respect GitHub's limits and gracefully queue requests instead of failing.

Implementation

["Audit all GitHub API calls across pipelines using GitHub's rate limit endpoint (/rate_limit) and identify the top 5 highest-frequency call patterns (e.g., GET /repos/{owner}/{repo}/commits).", 'Implement a centralized GitHub API proxy service with a request queue that enforces a maximum of 4,500 requests/hour (leaving 500 as buffer) using a token bucket algorithm.', 'Add exponential backoff with jitter to all API clients: first retry after 2s, then 4s, 8s, up to a 60s maximum, and surface remaining quota in pipeline logs.', 'Rotate between 3 GitHub OAuth tokens (service accounts) using round-robin distribution to effectively triple the available hourly quota to 15,000 requests.']

Expected Outcome

Pipeline failure rate due to GitHub API exhaustion drops from 15% during peak deployments to 0%, and release cycle time decreases because pipelines no longer need manual restarts.

Enforcing Fair Usage Across Tenants in a Multi-Tenant SaaS Analytics Platform

Problem

In a multi-tenant SaaS platform, a single enterprise customer running automated ETL jobs consumes 80% of the shared API capacity, causing degraded response times for 200 other tenants. The operations team receives support tickets blaming the platform for being slow, unaware that one tenant is the root cause.

Solution

Tenant-level rate limiting with isolated quotas per organization ID ensures no single tenant can monopolize API resources, and real-time quota dashboards give both the platform team and tenants visibility into their usage patterns.

Implementation

['Implement rate limiting at the API gateway layer keyed on the tenant Organization ID extracted from the JWT token, with configurable per-tenant limits stored in a central configuration service.', 'Set baseline limits (e.g., 500 req/min per tenant) with the ability to grant temporary burst allowances via an admin API for tenants with legitimate high-volume needs.', 'Build a self-service usage dashboard in the customer portal showing real-time request counts, quota percentage used, historical trends, and projected time to quota reset.', 'Configure automated email alerts to tenant admins when they reach 80% of their quota, and provide documentation on batching API calls and using webhooks as alternatives to polling.']

Expected Outcome

P95 API response time improves from 2.1 seconds to 340ms for all tenants, support tickets related to API slowness decrease by 70%, and the problematic tenant proactively optimizes their ETL jobs after seeing their usage dashboard.

Securing a Financial Services API Against Credential Stuffing Attacks

Problem

A banking API's authentication endpoint receives thousands of login attempts per second from credential stuffing bots testing leaked username/password combinations. The attacks bypass traditional IP blocking because they originate from distributed residential proxy networks with thousands of unique IP addresses.

Solution

Layered rate limiting combining per-IP limits, per-username limits, and global endpoint limits detects and throttles brute-force patterns even when distributed across many source IPs, protecting accounts without blocking legitimate users.

Implementation

['Apply a strict rate limit of 5 login attempts per username per 15-minute window, regardless of source IP, to prevent distributed credential stuffing targeting specific accounts.', 'Implement a per-IP limit of 20 authentication requests per minute as a secondary layer, with automatic temporary IP blocking (1 hour) after 3 consecutive rate limit violations from the same IP.', "Add a global circuit breaker that reduces the authentication endpoint's rate limit by 50% automatically when the error rate exceeds 30% of requests, protecting backend infrastructure during large-scale attacks.", 'Integrate rate limit events with the SIEM system (e.g., Splunk) to trigger real-time fraud alerts when more than 1,000 unique usernames are rate-limited within a 5-minute window, indicating a coordinated attack.']

Expected Outcome

Credential stuffing attack success rate drops by 99.7%, account takeover incidents decrease from an average of 12 per month to less than 1, and the security team receives automated attack detection alerts within 2 minutes of an attack starting.

Best Practices

Return Standardized Rate Limit Headers on Every API Response

Clients need real-time visibility into their quota status to implement intelligent throttling on their side. Including X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, and Retry-After headers on every response—not just 429 errors—allows well-behaved clients to proactively slow down before hitting limits rather than reactively handling errors.

✓ Do: Include rate limit headers on all responses (200, 400, 500) so clients can monitor their consumption continuously and implement predictive throttling. Follow the IETF RateLimit Headers draft standard for interoperability.
✗ Don't: Don't only return rate limit information in 429 error responses. Clients that only see quota data when they've already exceeded the limit cannot implement proactive backoff strategies.

Use Sliding Window Counters Instead of Fixed Window Counters

Fixed window rate limiting resets counters at rigid intervals (e.g., every 60 seconds on the minute), creating a 'double burst' vulnerability where a client can make 100 requests at 11:59 and another 100 at 12:00, effectively sending 200 requests in 2 seconds. Sliding window algorithms track requests within a rolling time period, preventing burst exploitation at window boundaries.

✓ Do: Implement sliding window log or sliding window counter algorithms using Redis sorted sets or similar structures to accurately measure request rates across any arbitrary time window.
✗ Don't: Don't use fixed window counters for security-sensitive endpoints like authentication or payment APIs where burst exploitation at window resets could enable attacks or cause backend overload.

Differentiate Rate Limits by API Endpoint Sensitivity and Cost

Not all API endpoints have equal computational cost or security sensitivity. A GET /health endpoint costs microseconds while POST /reports/generate triggers a 30-second database query. Applying a single uniform rate limit across all endpoints either over-restricts cheap endpoints or under-protects expensive ones, wasting capacity or leaving the system vulnerable.

✓ Do: Define per-endpoint rate limits based on measured backend processing cost, data sensitivity, and abuse potential. Apply stricter limits (e.g., 10 req/min) to expensive compute or data-heavy endpoints and more permissive limits (e.g., 1000 req/min) to lightweight read endpoints.
✗ Don't: Don't apply a single global rate limit across all API endpoints. This creates a situation where a client hammering an expensive report-generation endpoint consumes quota that blocks them from making lightweight status check calls.

Implement Exponential Backoff with Jitter in API Client Libraries

When multiple clients hit a rate limit simultaneously and all retry at the same fixed interval, they create a 'thundering herd' that immediately re-saturates the API the moment the window resets. Exponential backoff with random jitter spreads retry attempts across time, preventing synchronized retry storms and allowing the server to recover gracefully.

✓ Do: Implement retry logic that doubles the wait time after each failed attempt (e.g., 1s, 2s, 4s, 8s) and adds a random jitter of ±20% to desynchronize retries across clients. Set a maximum retry count (e.g., 5) and maximum wait time (e.g., 60s) to prevent infinite loops.
✗ Don't: Don't implement fixed-interval retries (e.g., retry every 5 seconds) or immediate retries on 429 responses. Fixed retries from thousands of clients all hitting the limit simultaneously will create retry storms that overwhelm the API immediately after each window reset.

Monitor Rate Limit Hit Rates as a Primary API Health Metric

The percentage of requests resulting in 429 responses is a leading indicator of both abuse patterns and misconfigured client applications. A sudden spike in rate limit hits can signal a DDoS attempt, a bug in a client application causing runaway API calls, or that a tier's quota has become too restrictive as legitimate usage grows. Treating this metric with the same urgency as error rates enables proactive capacity management.

✓ Do: Track rate limit hit rate (429s / total requests) per API key, per endpoint, and per tenant in real-time dashboards. Set alerting thresholds (e.g., alert when any single API key generates more than 20% 429 responses over 5 minutes) and review weekly trends to adjust quotas before they become a customer pain point.
✗ Don't: Don't treat 429 responses as non-events or exclude them from API error rate calculations. Ignoring rate limit metrics means discovering quota problems only when customers file support tickets, by which point they may have already churned or built workarounds that bypass your intended usage policies.

How Docsie Helps with API Rate Limiting

Build Better Documentation with Docsie

Join thousands of teams creating outstanding documentation

Start Free Trial