← back to posts

Practical Go Performance Tuning in Real Production Systems

·

Performance tuning in Go isn’t about micro-benchmarks. It’s about understanding where your production service spends time and money. Here’s what actually moves the needle.

Start with Production Data

Before optimizing anything, know your baseline:

// Instrument your HTTP handlers
func instrumentedHandler(name string, h http.HandlerFunc) http.HandlerFunc {
    histogram := prometheus.NewHistogram(prometheus.HistogramOpts{
        Name:    "http_request_duration_seconds",
        Help:    "Request duration",
        Buckets: []float64{.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5},
        ConstLabels: prometheus.Labels{"handler": name},
    })

    return func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        h(w, r)
        histogram.Observe(time.Since(start).Seconds())
    }
}

Look at P99 latency, not averages. The average can look fine while 1% of users wait 10 seconds.

GC Tuning

Go’s garbage collector is generational and concurrent, but it still causes latency spikes. The key knob is GOGC:

# Default: GC runs when heap doubles (GOGC=100)
# More memory, less GC: (GOGC=200)
# Less memory, more GC: (GOGC=50)
GOGC=200 ./myservice

Go 1.19+ has GOMEMLIMIT — a soft memory limit that’s usually better than GOGC:

# Use up to 4GB, GC more aggressively near the limit
GOMEMLIMIT=4GiB ./myservice

This prevents OOM kills while letting Go use available memory efficiently.

Monitor GC impact:

func trackGC() {
    var stats debug.GCStats
    ticker := time.NewTicker(10 * time.Second)
    for range ticker.C {
        debug.ReadGCStats(&stats)
        if len(stats.Pause) > 0 {
            gcPauseMs.Set(float64(stats.Pause[0].Milliseconds()))
        }
        gcCount.Set(float64(stats.NumGC))
    }
}

Reduce Allocations

Every allocation is work for the GC. The biggest wins come from avoiding allocations in hot paths.

Use value types instead of pointers where possible:

// This allocates on the heap
func newUser(name string) *User {
    return &User{Name: name}
}

// This stays on the stack (if it doesn't escape)
func newUser(name string) User {
    return User{Name: name}
}

Use escape analysis to check:

go build -gcflags="-m" ./...
# Look for "escapes to heap"

Avoid interface{} in hot paths:

// Every value stored as interface{} requires allocation
cache := map[string]interface{}{}

// Type-specific map avoids interface boxing
cache := map[string]User{}

Efficient I/O

Use io.Reader/Writer interfaces to avoid buffering entire responses in memory:

// BAD: reads entire body into memory
body, _ := io.ReadAll(resp.Body)
var result Result
json.Unmarshal(body, &result)

// GOOD: streams directly
var result Result
json.NewDecoder(resp.Body).Decode(&result)

For file I/O, use buffered readers/writers:

file, _ := os.Open("large-file.csv")
reader := bufio.NewReaderSize(file, 64*1024) // 64KB buffer
scanner := bufio.NewScanner(reader)
for scanner.Scan() {
    processLine(scanner.Bytes()) // Bytes() avoids string allocation
}

Connection Pool Sizing

Undersized pools cause goroutines to block waiting for connections. Oversized pools waste memory and can overwhelm backends.

Formula I use:

pool_size = (requests_per_second * avg_query_duration_seconds) * 1.5

For 1000 RPS with 5ms average query time:

pool_size = (1000 * 0.005) * 1.5 = 7.5 → 10 connections

Monitor pool wait time:

pool.Config().AfterConnect = func(ctx context.Context, conn *pgx.Conn) error {
    poolWaitTime.Observe(time.Since(requestStart).Seconds())
    return nil
}

Benchmarks That Matter

Don’t benchmark sort.Slice for fun. Benchmark your actual hot paths:

func BenchmarkOrderCreation(b *testing.B) {
    db := setupTestDB(b)
    svc := NewOrderService(db)
    ctx := context.Background()

    b.ResetTimer()
    b.ReportAllocs()

    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            _, err := svc.CreateOrder(ctx, testOrder())
            if err != nil {
                b.Fatal(err)
            }
        }
    })
}

b.RunParallel simulates concurrent load — more realistic than sequential benchmarks.

Compare before and after with benchstat:

go test -bench=BenchmarkOrderCreation -count=10 > old.txt
# ... make changes ...
go test -bench=BenchmarkOrderCreation -count=10 > new.txt
benchstat old.txt new.txt

The 80/20 Rule

In my experience, 80% of performance gains come from:

  1. Database query optimization (indexes, batch reads, connection pooling)
  2. HTTP client connection reuse
  3. Reducing allocations in hot paths (sync.Pool, pre-allocation)
  4. Caching (in-memory for read-heavy data)

The other 20% — GC tuning, custom serializers, assembly optimizations — rarely matters unless you’re at extreme scale.

Profile first. Optimize what the data tells you. Resist the urge to optimize everything.