Understanding Go Scheduler Behavior in High-Concurrency Systems

Go makes concurrency easy. go func() and you’re done. But when your service runs 100K+ concurrent goroutines, understanding the scheduler becomes critical for diagnosing latency issues and GC pauses.

The GMP Model

Go’s scheduler has three concepts:

G (Goroutine): a lightweight thread of execution
M (Machine): an OS thread
P (Processor): a scheduling context, limited by GOMAXPROCS

The relationship: each P has a local run queue of Gs. Each P is attached to one M at a time. When a G blocks on I/O, the M is released and the P picks up another M to continue running other Gs.

P0 ─── M0 ─── [G1, G2, G3]  (local run queue)
P1 ─── M1 ─── [G4, G5]
P2 ─── M2 ─── [G6, G7, G8, G9]
              [G10, G11, ...]  (global run queue)

GOMAXPROCS

GOMAXPROCS sets the number of Ps — the maximum number of goroutines executing simultaneously. Default is the number of CPU cores.

In containers, this can be wrong:

// In a container with 2 CPU cores but running on a 64-core host,
// runtime.NumCPU() might return 64

import "go.uber.org/automaxprocs/maxprocs"

func main() {
    // Automatically set GOMAXPROCS based on container CPU quota
    maxprocs.Set()
}

Without this, your Go service thinks it has 64 cores, creates 64 Ps, and gets throttled by the container’s CPU limit — causing erratic latency.

Work Stealing

When a P’s local queue is empty, it steals work from other Ps:

Check local run queue
Check global run queue
Check network poller (for goroutines unblocked by I/O)
Steal from another P’s local queue

This is why Go scales well across cores — idle Ps don’t stay idle.

But it also means goroutine scheduling isn’t perfectly fair. A goroutine on a busy P’s queue might wait longer than one on an idle P’s queue.

Preemption

Before Go 1.14, goroutines were only preempted at function calls. A tight loop without function calls could starve other goroutines:

// Pre-1.14: this would never yield
go func() {
    for {
        // tight computation loop
        x = x*2 + 1
    }
}()

Go 1.14+ uses asynchronous preemption via signals. The runtime can interrupt any goroutine, even in tight loops. But this introduces overhead — each preemption point involves signal handling.

What Happens with 1M Goroutines

Each goroutine starts with a 2KB stack (grows as needed). A million goroutines = ~2GB of stack memory alone.

func main() {
    var wg sync.WaitGroup
    for i := 0; i < 1_000_000; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            time.Sleep(time.Minute)
        }()
    }
    wg.Wait()
}

This works — Go handles it. But the scheduler overhead becomes significant:

Scheduling latency: with 1M goroutines and 8 Ps, each P manages ~125K goroutines. Context switching overhead adds up.
GC pauses: the GC must scan all goroutine stacks. More goroutines = longer GC pauses.
Memory: stacks + scheduler metadata. Even sleeping goroutines consume memory.

Practical Limits

In my experience:

< 10K goroutines: no issues, don’t think about it
10K-100K goroutines: monitor GC pauses and scheduling latency
100K+ goroutines: consider whether you actually need that many, or if a worker pool pattern is better

// Instead of 1M goroutines
for _, task := range tasks {
    go process(task)
}

// Use a bounded worker pool
pool := make(chan struct{}, 1000) // 1000 concurrent workers
for _, task := range tasks {
    pool <- struct{}{}
    go func(t Task) {
        defer func() { <-pool }()
        process(t)
    }(task)
}

Diagnosing Scheduler Issues

1. Schedule latency histogram:

func measureScheduleLatency() {
    go func() {
        for {
            start := time.Now()
            runtime.Gosched() // Yield and measure how long until rescheduled
            latency := time.Since(start)
            scheduleLatency.Observe(latency.Seconds())
            time.Sleep(100 * time.Millisecond)
        }
    }()
}

If Gosched() takes more than a few microseconds, your Ps are overloaded.

2. Goroutine count over time:

goroutineGauge.Set(float64(runtime.NumGoroutine()))

If this trends upward, you have a goroutine leak.

3. Execution tracer:

curl http://localhost:6060/debug/pprof/trace?seconds=5 > trace.out
go tool trace trace.out

The trace viewer shows exactly what each P is doing — scheduling, GC, blocking, running. It’s the most powerful tool for understanding scheduler behavior.

Key Takeaways

Set GOMAXPROCS correctly in containers
Use worker pools instead of unbounded goroutine spawning
Monitor goroutine count — leaks are silent killers
GC pauses scale with goroutine count — keep stacks shallow
Use go tool trace for deep scheduler analysis
The scheduler is good, but it’s not magic — give it reasonable workloads

Understanding the scheduler doesn’t change how you write most code. But when something is slow and you can’t figure out why, this knowledge is the difference between a 5-minute fix and a 5-day investigation.