Production-Ready Logging and Observability in Go
You can’t debug what you can’t see. Observability isn’t a nice-to-have — it’s the difference between a 5-minute fix and a 5-hour investigation. Here’s the stack I use for every Go service.
Structured Logging with slog
Go 1.21’s slog package is now my default. No more third-party logging libraries.
func setupLogger(env string) *slog.Logger {
var handler slog.Handler
if env == "production" {
handler = slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
Level: slog.LevelInfo,
})
} else {
handler = slog.NewTextHandler(os.Stdout, &slog.HandlerOptions{
Level: slog.LevelDebug,
})
}
return slog.New(handler)
}
JSON in production (machine-parseable), text in development (human-readable).
Request Context in Every Log
The most important pattern: every log line includes the request context.
func RequestLogger(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
correlationID := r.Header.Get("X-Correlation-ID")
if correlationID == "" {
correlationID = uuid.New().String()
}
logger := slog.With(
"correlation_id", correlationID,
"method", r.Method,
"path", r.URL.Path,
)
ctx := withLogger(r.Context(), logger)
next.ServeHTTP(w, r.WithContext(ctx))
})
}
// Anywhere in the codebase:
func processOrder(ctx context.Context, order Order) error {
log := loggerFromContext(ctx)
log.Info("processing order", "order_id", order.ID)
// ...
}
Every log line automatically includes correlation_id, method, and path. When debugging, filter by correlation_id to see the complete request flow.
Metrics with Prometheus
Four golden signals: latency, traffic, errors, saturation.
var (
httpRequestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5},
},
[]string{"method", "path", "status"},
)
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{Name: "http_requests_total"},
[]string{"method", "path", "status"},
)
dbQueryDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "db_query_duration_seconds",
Buckets: []float64{.001, .005, .01, .025, .05, .1, .25, .5, 1},
},
[]string{"query"},
)
dbConnectionsInUse = prometheus.NewGauge(
prometheus.GaugeOpts{Name: "db_connections_in_use"},
)
)
func MetricsMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
wrapped := &statusRecorder{ResponseWriter: w, statusCode: 200}
next.ServeHTTP(wrapped, r)
status := strconv.Itoa(wrapped.statusCode)
duration := time.Since(start).Seconds()
httpRequestDuration.WithLabelValues(r.Method, r.URL.Path, status).Observe(duration)
httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, status).Inc()
})
}
Distributed Tracing
For microservices, traces show the complete journey of a request:
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
"go.opentelemetry.io/otel/sdk/trace"
)
func initTracer(ctx context.Context, serviceName string) (func(), error) {
exporter, err := otlptracehttp.New(ctx)
if err != nil {
return nil, err
}
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
trace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceName(serviceName),
)),
)
otel.SetTracerProvider(tp)
return func() { tp.Shutdown(context.Background()) }, nil
}
Add spans to your service calls:
var tracer = otel.Tracer("order-service")
func (s *OrderService) Create(ctx context.Context, order Order) (*Order, error) {
ctx, span := tracer.Start(ctx, "OrderService.Create")
defer span.End()
span.SetAttributes(
attribute.String("customer_id", order.CustomerID),
attribute.Int("item_count", len(order.Items)),
)
// Downstream calls automatically create child spans
if err := s.inventory.Reserve(ctx, order.Items); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
return nil, err
}
return s.repo.Create(ctx, order)
}
Alerting Rules
Metrics are useless without alerts. The essentials:
# High error rate
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
# High latency
- alert: HighLatency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
# Database connection pool exhaustion
- alert: DBPoolExhausted
expr: db_connections_in_use / db_connections_max > 0.9
for: 2m
The Three Pillars Together
A single request should be traceable through all three:
- Logs: detailed event-by-event record, filterable by correlation_id
- Metrics: aggregate trends — is error rate increasing? Is latency growing?
- Traces: visual timeline of a request across services
User request → API Gateway → Order Service → Payment Service → DB
│ │ │ │ │
└── trace_id: abc-123 links all spans together
└── correlation_id: abc-123 in every log line
└── http_request_duration_seconds metric updated at each service
When an alert fires (metrics), you look at traces to identify which service is slow, then look at logs for that service to find the specific error. The three pillars work together.
Observability is the most important infrastructure you’ll build. Everything else — debugging, performance optimization, capacity planning — depends on it.