Designing Systems That Survive Production Traffic
Production traffic is nothing like your load tests. It comes in bursts. It has pathological patterns. Users do things you never imagined. Here’s how to build systems that survive.
Load Shedding
When your system is overloaded, it’s better to reject some requests quickly than to serve all requests slowly. Slow responses are worse than fast errors — they tie up client connections and cascade downstream.
func LoadSheddingMiddleware(maxConcurrent int) func(http.Handler) http.Handler {
sem := make(chan struct{}, maxConcurrent)
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
select {
case sem <- struct{}{}:
defer func() { <-sem }()
next.ServeHTTP(w, r)
default:
shedCounter.Inc()
http.Error(w, "service overloaded", http.StatusServiceUnavailable)
}
})
}
}
Set maxConcurrent based on your capacity testing. When all slots are taken, new requests get 503 immediately.
Graceful Degradation
Not all features are equally important. When under pressure, shed non-essential features:
func (s *ProductService) GetProduct(ctx context.Context, id string) (*Product, error) {
product, err := s.repo.GetProduct(ctx, id)
if err != nil {
return nil, err
}
// Non-essential: recommendations
if !s.degraded.Load() {
recs, err := s.recService.Get(ctx, id)
if err != nil {
slog.Warn("recommendations unavailable", "error", err)
// Continue without recommendations
} else {
product.Recommendations = recs
}
}
// Non-essential: reviews
if !s.degraded.Load() {
reviews, err := s.reviewService.Get(ctx, id)
if err != nil {
slog.Warn("reviews unavailable", "error", err)
} else {
product.Reviews = reviews
}
}
return product, nil
}
In degraded mode, skip optional features entirely. The core product page loads in 20ms instead of 200ms.
Timeout Budgets
A request has a total time budget. Divide it across operations:
func (s *OrderService) PlaceOrder(ctx context.Context, order Order) error {
// Total budget: 10 seconds
ctx, cancel := context.WithTimeout(ctx, 10*time.Second)
defer cancel()
// Inventory: 3s budget
invCtx, _ := context.WithTimeout(ctx, 3*time.Second)
if err := s.inventory.Reserve(invCtx, order.Items); err != nil {
return fmt.Errorf("inventory: %w", err)
}
// Payment: 5s budget (external, slower)
payCtx, _ := context.WithTimeout(ctx, 5*time.Second)
if err := s.payment.Charge(payCtx, order.Total); err != nil {
s.inventory.Release(ctx, order.Items)
return fmt.Errorf("payment: %w", err)
}
// DB write: 2s budget
dbCtx, _ := context.WithTimeout(ctx, 2*time.Second)
return s.repo.Create(dbCtx, order)
}
If any step exceeds its budget, it fails fast. The parent 10s budget ensures the total never exceeds 10 seconds even if individual budgets overlap.
Bulkheads
Isolate failures so one bad dependency doesn’t take down everything:
type BulkheadedService struct {
criticalPool chan struct{} // 50 concurrent
nonCriticalPool chan struct{} // 20 concurrent
}
func (s *BulkheadedService) CriticalOperation(ctx context.Context) error {
select {
case s.criticalPool <- struct{}{}:
defer func() { <-s.criticalPool }()
return s.doCritical(ctx)
case <-ctx.Done():
return ctx.Err()
}
}
func (s *BulkheadedService) NonCriticalOperation(ctx context.Context) error {
select {
case s.nonCriticalPool <- struct{}{}:
defer func() { <-s.nonCriticalPool }()
return s.doNonCritical(ctx)
default:
return ErrNonCriticalShed // Immediately reject
}
}
If the non-critical dependency is slow, it only exhausts its own pool. Critical operations are unaffected.
Retry Storms Prevention
When a service recovers from an outage, all clients retry simultaneously — creating an even bigger spike than the original traffic.
Exponential backoff with jitter:
func retryWithJitter(attempt int) time.Duration {
base := time.Duration(math.Pow(2, float64(attempt))) * time.Second
jitter := time.Duration(rand.Int63n(int64(base) / 2))
return base + jitter
}
Client-side retry budget:
type RetryBudget struct {
mu sync.Mutex
attempts int
window time.Duration
max int
reset time.Time
}
func (b *RetryBudget) CanRetry() bool {
b.mu.Lock()
defer b.mu.Unlock()
if time.Now().After(b.reset) {
b.attempts = 0
b.reset = time.Now().Add(b.window)
}
if b.attempts >= b.max {
return false
}
b.attempts++
return true
}
Capacity Planning
Know your limits before production teaches you:
func loadTest() {
// Find the breaking point
for rps := 100; rps <= 10000; rps += 100 {
result := runBenchmark(rps, 60*time.Second)
fmt.Printf("RPS: %d, P50: %v, P99: %v, Errors: %.1f%%\n",
rps, result.P50, result.P99, result.ErrorRate*100)
if result.ErrorRate > 0.01 || result.P99 > 2*time.Second {
fmt.Printf("Breaking point: %d RPS\n", rps)
break
}
}
}
Set alerts at 70% of your breaking point. Scale up before you hit the wall.
The Survival Checklist
- Load shedding: reject fast when overloaded
- Circuit breakers: stop calling failing dependencies
- Graceful degradation: shed non-essential features under pressure
- Timeout budgets: no open-ended waits
- Bulkheads: isolate failure domains
- Retry budgets: prevent retry storms
- Health checks: let load balancers route around failures
- Capacity planning: know your limits
Production will test every assumption you make about how systems behave under stress. These patterns don’t prevent failures — they make failures survivable.