Reducing Latency in Go APIs: Lessons from Production

Our API’s P99 latency was 800ms. Users were complaining. After two weeks of profiling and optimization, we got it under 100ms. Here’s exactly what we did.

Step 1: Measure Everything

Before touching code, we instrumented every layer:

func (s *Service) GetOrder(ctx context.Context, id string) (*Order, error) {
    defer trackLatency("GetOrder", time.Now())

    order, err := s.cache.Get(ctx, id)
    if err == nil {
        cacheHits.Inc()
        return order, nil
    }
    cacheMisses.Inc()

    order, err = s.repo.GetOrder(ctx, id)
    if err != nil {
        return nil, err
    }

    s.cache.Set(ctx, id, order, 5*time.Minute)
    return order, nil
}

The breakdown revealed:

60% of time: database queries
25% of time: downstream HTTP calls
10% of time: JSON serialization
5% of time: application logic

Step 2: Fix the Database

The biggest offender was a missing composite index. One query was doing a sequential scan on a 50M-row table:

-- Before: 400ms (sequential scan)
SELECT * FROM orders WHERE customer_id = $1 AND status = 'active' ORDER BY created_at DESC LIMIT 10;

-- After adding index: 2ms
CREATE INDEX CONCURRENTLY idx_orders_customer_active
ON orders (customer_id, created_at DESC) WHERE status = 'active';

Partial indexes are criminally underused. If you always filter on status = 'active', index only those rows.

We also found N+1 queries hiding in a loop:

// Before: 50 queries for 50 orders
for _, order := range orders {
    items, _ := repo.GetItemsByOrderID(ctx, order.ID)
    order.Items = items
}

// After: 1 query
itemsByOrder, _ := repo.GetItemsByOrderIDs(ctx, orderIDs)
for _, order := range orders {
    order.Items = itemsByOrder[order.ID]
}

This alone cut 300ms off the P99.

Step 3: Fix Downstream Calls

Our service called three downstream services sequentially. Two of them were independent:

// Before: sequential, ~200ms total
user, _ := userService.Get(ctx, userID)          // 80ms
preferences, _ := prefService.Get(ctx, userID)   // 70ms
history, _ := historyService.Get(ctx, userID)     // 50ms

// After: parallel independent calls, ~80ms total
g, ctx := errgroup.WithContext(ctx)

var user *User
var preferences *Preferences
var history *History

g.Go(func() error {
    var err error
    user, err = userService.Get(ctx, userID)
    return err
})
g.Go(func() error {
    var err error
    preferences, err = prefService.Get(ctx, userID)
    return err
})
g.Go(func() error {
    var err error
    history, err = historyService.Get(ctx, userID)
    return err
})

if err := g.Wait(); err != nil {
    return nil, err
}

Parallel calls reduced the downstream time from 200ms to 80ms.

Step 4: Add Caching

For data that changes infrequently, cache aggressively:

type TieredCache struct {
    local  *lru.Cache    // L1: in-process, ~1ms
    redis  *redis.Client // L2: network, ~5ms
}

func (c *TieredCache) Get(ctx context.Context, key string) ([]byte, error) {
    // L1
    if val, ok := c.local.Get(key); ok {
        return val.([]byte), nil
    }

    // L2
    val, err := c.redis.Get(ctx, key).Bytes()
    if err == nil {
        c.local.Add(key, val)
        return val, nil
    }

    return nil, ErrCacheMiss
}

In-process caching eliminated 40% of Redis calls. For our most-hit endpoints, cache hit rate was 85%.

Step 5: Async What You Can

Some work doesn’t need to happen in the request path:

// Before: send email synchronously (adds 100-500ms)
func (s *Service) CreateOrder(ctx context.Context, order Order) error {
    if err := s.repo.Create(ctx, order); err != nil {
        return err
    }
    return s.emailService.SendConfirmation(ctx, order) // Slow!
}

// After: publish event, handle email asynchronously
func (s *Service) CreateOrder(ctx context.Context, order Order) error {
    if err := s.repo.Create(ctx, order); err != nil {
        return err
    }
    s.events.Publish(ctx, OrderCreatedEvent{OrderID: order.ID})
    return nil // Return immediately
}

If the user doesn’t need to see the result in this response, don’t make them wait.

Results

Metric	Before	After
P50	200ms	25ms
P99	800ms	95ms
P99.9	2.5s	200ms

The fixes weren’t exotic. Missing index, N+1 queries, sequential calls that should be parallel, missing cache, synchronous work that should be async. Boring fundamentals — dramatic results.