← back to posts

Designing Systems That Survive Production Traffic

Production traffic is nothing like your load tests. It comes in bursts. It has pathological patterns. Users do things you never imagined. Here’s how to build systems that survive.

Load Shedding

When your system is overloaded, it’s better to reject some requests quickly than to serve all requests slowly. Slow responses are worse than fast errors — they tie up client connections and cascade downstream.

func LoadSheddingMiddleware(maxConcurrent int) func(http.Handler) http.Handler {
    sem := make(chan struct{}, maxConcurrent)

    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            select {
            case sem <- struct{}{}:
                defer func() { <-sem }()
                next.ServeHTTP(w, r)
            default:
                shedCounter.Inc()
                http.Error(w, "service overloaded", http.StatusServiceUnavailable)
            }
        })
    }
}

Set maxConcurrent based on your capacity testing. When all slots are taken, new requests get 503 immediately.

Graceful Degradation

Not all features are equally important. When under pressure, shed non-essential features:

func (s *ProductService) GetProduct(ctx context.Context, id string) (*Product, error) {
    product, err := s.repo.GetProduct(ctx, id)
    if err != nil {
        return nil, err
    }

    // Non-essential: recommendations
    if !s.degraded.Load() {
        recs, err := s.recService.Get(ctx, id)
        if err != nil {
            slog.Warn("recommendations unavailable", "error", err)
            // Continue without recommendations
        } else {
            product.Recommendations = recs
        }
    }

    // Non-essential: reviews
    if !s.degraded.Load() {
        reviews, err := s.reviewService.Get(ctx, id)
        if err != nil {
            slog.Warn("reviews unavailable", "error", err)
        } else {
            product.Reviews = reviews
        }
    }

    return product, nil
}

In degraded mode, skip optional features entirely. The core product page loads in 20ms instead of 200ms.

Timeout Budgets

A request has a total time budget. Divide it across operations:

func (s *OrderService) PlaceOrder(ctx context.Context, order Order) error {
    // Total budget: 10 seconds
    ctx, cancel := context.WithTimeout(ctx, 10*time.Second)
    defer cancel()

    // Inventory: 3s budget
    invCtx, _ := context.WithTimeout(ctx, 3*time.Second)
    if err := s.inventory.Reserve(invCtx, order.Items); err != nil {
        return fmt.Errorf("inventory: %w", err)
    }

    // Payment: 5s budget (external, slower)
    payCtx, _ := context.WithTimeout(ctx, 5*time.Second)
    if err := s.payment.Charge(payCtx, order.Total); err != nil {
        s.inventory.Release(ctx, order.Items)
        return fmt.Errorf("payment: %w", err)
    }

    // DB write: 2s budget
    dbCtx, _ := context.WithTimeout(ctx, 2*time.Second)
    return s.repo.Create(dbCtx, order)
}

If any step exceeds its budget, it fails fast. The parent 10s budget ensures the total never exceeds 10 seconds even if individual budgets overlap.

Bulkheads

Isolate failures so one bad dependency doesn’t take down everything:

type BulkheadedService struct {
    criticalPool    chan struct{} // 50 concurrent
    nonCriticalPool chan struct{} // 20 concurrent
}

func (s *BulkheadedService) CriticalOperation(ctx context.Context) error {
    select {
    case s.criticalPool <- struct{}{}:
        defer func() { <-s.criticalPool }()
        return s.doCritical(ctx)
    case <-ctx.Done():
        return ctx.Err()
    }
}

func (s *BulkheadedService) NonCriticalOperation(ctx context.Context) error {
    select {
    case s.nonCriticalPool <- struct{}{}:
        defer func() { <-s.nonCriticalPool }()
        return s.doNonCritical(ctx)
    default:
        return ErrNonCriticalShed // Immediately reject
    }
}

If the non-critical dependency is slow, it only exhausts its own pool. Critical operations are unaffected.

Retry Storms Prevention

When a service recovers from an outage, all clients retry simultaneously — creating an even bigger spike than the original traffic.

Exponential backoff with jitter:

func retryWithJitter(attempt int) time.Duration {
    base := time.Duration(math.Pow(2, float64(attempt))) * time.Second
    jitter := time.Duration(rand.Int63n(int64(base) / 2))
    return base + jitter
}

Client-side retry budget:

type RetryBudget struct {
    mu       sync.Mutex
    attempts int
    window   time.Duration
    max      int
    reset    time.Time
}

func (b *RetryBudget) CanRetry() bool {
    b.mu.Lock()
    defer b.mu.Unlock()

    if time.Now().After(b.reset) {
        b.attempts = 0
        b.reset = time.Now().Add(b.window)
    }

    if b.attempts >= b.max {
        return false
    }
    b.attempts++
    return true
}

Capacity Planning

Know your limits before production teaches you:

func loadTest() {
    // Find the breaking point
    for rps := 100; rps <= 10000; rps += 100 {
        result := runBenchmark(rps, 60*time.Second)

        fmt.Printf("RPS: %d, P50: %v, P99: %v, Errors: %.1f%%\n",
            rps, result.P50, result.P99, result.ErrorRate*100)

        if result.ErrorRate > 0.01 || result.P99 > 2*time.Second {
            fmt.Printf("Breaking point: %d RPS\n", rps)
            break
        }
    }
}

Set alerts at 70% of your breaking point. Scale up before you hit the wall.

The Survival Checklist

  • Load shedding: reject fast when overloaded
  • Circuit breakers: stop calling failing dependencies
  • Graceful degradation: shed non-essential features under pressure
  • Timeout budgets: no open-ended waits
  • Bulkheads: isolate failure domains
  • Retry budgets: prevent retry storms
  • Health checks: let load balancers route around failures
  • Capacity planning: know your limits

Production will test every assumption you make about how systems behave under stress. These patterns don’t prevent failures — they make failures survivable.