Handling Partial Failures in Distributed Go Systems
In distributed systems, failure is not an exception — it’s a feature of the environment. The database is slow. The downstream service returns 500. The network drops packets. Your system needs to handle these gracefully, not catastrophically.
The Reality of Partial Failure
In a monolith, either everything works or nothing works. In distributed systems, service A might be fine while service B is on fire. This creates states that don’t exist in monoliths:
- Order created successfully, but notification service failed
- Payment charged, but inventory service timed out
- User registered, but welcome email never sent
You can’t prevent partial failures. You can only design for them.
Circuit Breakers
When a downstream service is failing, stop hitting it. A circuit breaker tracks failure rates and “opens” when failures exceed a threshold:
type CircuitBreaker struct {
mu sync.Mutex
state State
failures int
successes int
threshold int
timeout time.Duration
lastFailure time.Time
}
type State int
const (
Closed State = iota // Normal operation
Open // Failing, reject requests
HalfOpen // Testing if service recovered
)
func (cb *CircuitBreaker) Execute(fn func() error) error {
cb.mu.Lock()
switch cb.state {
case Open:
if time.Since(cb.lastFailure) > cb.timeout {
cb.state = HalfOpen
cb.mu.Unlock()
return cb.tryHalfOpen(fn)
}
cb.mu.Unlock()
return ErrCircuitOpen
case HalfOpen:
cb.mu.Unlock()
return cb.tryHalfOpen(fn)
default:
cb.mu.Unlock()
return cb.tryClosed(fn)
}
}
func (cb *CircuitBreaker) tryClosed(fn func() error) error {
err := fn()
cb.mu.Lock()
defer cb.mu.Unlock()
if err != nil {
cb.failures++
if cb.failures >= cb.threshold {
cb.state = Open
cb.lastFailure = time.Now()
slog.Warn("circuit breaker opened", "failures", cb.failures)
}
return err
}
cb.failures = 0
return nil
}
When the circuit opens, requests fail immediately with ErrCircuitOpen instead of waiting for a timeout. This protects your service from cascading failures and gives the downstream service time to recover.
Retry Budgets
Not all errors deserve retries. And unlimited retries can amplify failures.
type RetryBudget struct {
maxRetries int
retryableErrors map[int]bool // HTTP status codes worth retrying
}
func (rb *RetryBudget) ShouldRetry(attempt int, err error) bool {
if attempt >= rb.maxRetries {
return false
}
var httpErr *HTTPError
if errors.As(err, &httpErr) {
return rb.retryableErrors[httpErr.StatusCode]
}
// Network errors are retryable
var netErr net.Error
if errors.As(err, &netErr) {
return netErr.Temporary()
}
return false
}
func WithRetry(ctx context.Context, budget *RetryBudget, fn func() error) error {
var lastErr error
for attempt := 0; attempt <= budget.maxRetries; attempt++ {
lastErr = fn()
if lastErr == nil {
return nil
}
if !budget.ShouldRetry(attempt, lastErr) {
return lastErr
}
delay := backoffDelay(attempt)
select {
case <-time.After(delay):
case <-ctx.Done():
return ctx.Err()
}
}
return lastErr
}
Retry 429 (rate limited) and 503 (overloaded). Don’t retry 400 (bad request) or 404 (not found) — those won’t magically succeed.
Fallbacks and Degraded Mode
When a dependency fails, return something useful instead of an error:
func (s *ProductService) GetRecommendations(ctx context.Context, userID string) ([]Product, error) {
recs, err := s.recommendationClient.Get(ctx, userID)
if err != nil {
slog.Warn("recommendation service unavailable, using fallback",
"user_id", userID,
"error", err,
)
// Return popular products instead of an error
return s.cache.GetPopularProducts(ctx)
}
return recs, nil
}
The user gets a slightly worse experience instead of an error page. Define fallback strategies for every non-critical dependency.
The Saga Pattern
For distributed transactions, use Sagas — a sequence of local transactions with compensating actions for rollback:
type Step struct {
Name string
Execute func(ctx context.Context) error
Compensate func(ctx context.Context) error
}
type Saga struct {
steps []Step
completed []Step
}
func (s *Saga) Run(ctx context.Context) error {
for _, step := range s.steps {
if err := step.Execute(ctx); err != nil {
slog.Error("saga step failed, compensating",
"step", step.Name,
"error", err,
)
return s.compensate(ctx, err)
}
s.completed = append(s.completed, step)
}
return nil
}
func (s *Saga) compensate(ctx context.Context, originalErr error) error {
// Run compensations in reverse order
for i := len(s.completed) - 1; i >= 0; i-- {
step := s.completed[i]
if step.Compensate == nil {
continue
}
if err := step.Compensate(ctx); err != nil {
slog.Error("compensation failed",
"step", step.Name,
"error", err,
)
// Log and continue — best effort compensation
}
}
return originalErr
}
Usage for an order placement flow:
saga := &Saga{
steps: []Step{
{
Name: "reserve_inventory",
Execute: func(ctx context.Context) error { return inventory.Reserve(ctx, items) },
Compensate: func(ctx context.Context) error { return inventory.Release(ctx, items) },
},
{
Name: "charge_payment",
Execute: func(ctx context.Context) error { return payment.Charge(ctx, amount) },
Compensate: func(ctx context.Context) error { return payment.Refund(ctx, amount) },
},
{
Name: "create_order",
Execute: func(ctx context.Context) error { return orders.Create(ctx, order) },
// No compensation — order creation is the final step
},
},
}
if err := saga.Run(ctx); err != nil {
return fmt.Errorf("order placement failed: %w", err)
}
If payment fails after inventory is reserved, the saga automatically releases the inventory.
Timeouts Everywhere
Every external call needs a timeout. No exceptions.
func (c *Client) GetUser(ctx context.Context, id string) (*User, error) {
ctx, cancel := context.WithTimeout(ctx, 3*time.Second)
defer cancel()
req, _ := http.NewRequestWithContext(ctx, "GET",
fmt.Sprintf("%s/users/%s", c.baseURL, id), nil)
resp, err := c.httpClient.Do(req)
if err != nil {
return nil, fmt.Errorf("get user: %w", err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
return nil, &HTTPError{StatusCode: resp.StatusCode}
}
var user User
json.NewDecoder(resp.Body).Decode(&user)
return &user, nil
}
Default HTTP client timeout + per-request context timeout + connection pool idle timeout. Layer timeouts so there’s always a safety net.