Practical Go Performance Tuning in Real Production Systems
Performance tuning in Go isn’t about micro-benchmarks. It’s about understanding where your production service spends time and money. Here’s what actually moves the needle.
Start with Production Data
Before optimizing anything, know your baseline:
// Instrument your HTTP handlers
func instrumentedHandler(name string, h http.HandlerFunc) http.HandlerFunc {
histogram := prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "Request duration",
Buckets: []float64{.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5},
ConstLabels: prometheus.Labels{"handler": name},
})
return func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
h(w, r)
histogram.Observe(time.Since(start).Seconds())
}
}
Look at P99 latency, not averages. The average can look fine while 1% of users wait 10 seconds.
GC Tuning
Go’s garbage collector is generational and concurrent, but it still causes latency spikes. The key knob is GOGC:
# Default: GC runs when heap doubles (GOGC=100)
# More memory, less GC: (GOGC=200)
# Less memory, more GC: (GOGC=50)
GOGC=200 ./myservice
Go 1.19+ has GOMEMLIMIT — a soft memory limit that’s usually better than GOGC:
# Use up to 4GB, GC more aggressively near the limit
GOMEMLIMIT=4GiB ./myservice
This prevents OOM kills while letting Go use available memory efficiently.
Monitor GC impact:
func trackGC() {
var stats debug.GCStats
ticker := time.NewTicker(10 * time.Second)
for range ticker.C {
debug.ReadGCStats(&stats)
if len(stats.Pause) > 0 {
gcPauseMs.Set(float64(stats.Pause[0].Milliseconds()))
}
gcCount.Set(float64(stats.NumGC))
}
}
Reduce Allocations
Every allocation is work for the GC. The biggest wins come from avoiding allocations in hot paths.
Use value types instead of pointers where possible:
// This allocates on the heap
func newUser(name string) *User {
return &User{Name: name}
}
// This stays on the stack (if it doesn't escape)
func newUser(name string) User {
return User{Name: name}
}
Use escape analysis to check:
go build -gcflags="-m" ./...
# Look for "escapes to heap"
Avoid interface{} in hot paths:
// Every value stored as interface{} requires allocation
cache := map[string]interface{}{}
// Type-specific map avoids interface boxing
cache := map[string]User{}
Efficient I/O
Use io.Reader/Writer interfaces to avoid buffering entire responses in memory:
// BAD: reads entire body into memory
body, _ := io.ReadAll(resp.Body)
var result Result
json.Unmarshal(body, &result)
// GOOD: streams directly
var result Result
json.NewDecoder(resp.Body).Decode(&result)
For file I/O, use buffered readers/writers:
file, _ := os.Open("large-file.csv")
reader := bufio.NewReaderSize(file, 64*1024) // 64KB buffer
scanner := bufio.NewScanner(reader)
for scanner.Scan() {
processLine(scanner.Bytes()) // Bytes() avoids string allocation
}
Connection Pool Sizing
Undersized pools cause goroutines to block waiting for connections. Oversized pools waste memory and can overwhelm backends.
Formula I use:
pool_size = (requests_per_second * avg_query_duration_seconds) * 1.5
For 1000 RPS with 5ms average query time:
pool_size = (1000 * 0.005) * 1.5 = 7.5 → 10 connections
Monitor pool wait time:
pool.Config().AfterConnect = func(ctx context.Context, conn *pgx.Conn) error {
poolWaitTime.Observe(time.Since(requestStart).Seconds())
return nil
}
Benchmarks That Matter
Don’t benchmark sort.Slice for fun. Benchmark your actual hot paths:
func BenchmarkOrderCreation(b *testing.B) {
db := setupTestDB(b)
svc := NewOrderService(db)
ctx := context.Background()
b.ResetTimer()
b.ReportAllocs()
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
_, err := svc.CreateOrder(ctx, testOrder())
if err != nil {
b.Fatal(err)
}
}
})
}
b.RunParallel simulates concurrent load — more realistic than sequential benchmarks.
Compare before and after with benchstat:
go test -bench=BenchmarkOrderCreation -count=10 > old.txt
# ... make changes ...
go test -bench=BenchmarkOrderCreation -count=10 > new.txt
benchstat old.txt new.txt
The 80/20 Rule
In my experience, 80% of performance gains come from:
- Database query optimization (indexes, batch reads, connection pooling)
- HTTP client connection reuse
- Reducing allocations in hot paths (sync.Pool, pre-allocation)
- Caching (in-memory for read-heavy data)
The other 20% — GC tuning, custom serializers, assembly optimizations — rarely matters unless you’re at extreme scale.
Profile first. Optimize what the data tells you. Resist the urge to optimize everything.