Understanding Go Scheduler Behavior in High-Concurrency Systems
Go makes concurrency easy. go func() and you’re done. But when your service runs 100K+ concurrent goroutines, understanding the scheduler becomes critical for diagnosing latency issues and GC pauses.
The GMP Model
Go’s scheduler has three concepts:
- G (Goroutine): a lightweight thread of execution
- M (Machine): an OS thread
- P (Processor): a scheduling context, limited by
GOMAXPROCS
The relationship: each P has a local run queue of Gs. Each P is attached to one M at a time. When a G blocks on I/O, the M is released and the P picks up another M to continue running other Gs.
P0 ─── M0 ─── [G1, G2, G3] (local run queue)
P1 ─── M1 ─── [G4, G5]
P2 ─── M2 ─── [G6, G7, G8, G9]
[G10, G11, ...] (global run queue)
GOMAXPROCS
GOMAXPROCS sets the number of Ps — the maximum number of goroutines executing simultaneously. Default is the number of CPU cores.
In containers, this can be wrong:
// In a container with 2 CPU cores but running on a 64-core host,
// runtime.NumCPU() might return 64
import "go.uber.org/automaxprocs/maxprocs"
func main() {
// Automatically set GOMAXPROCS based on container CPU quota
maxprocs.Set()
}
Without this, your Go service thinks it has 64 cores, creates 64 Ps, and gets throttled by the container’s CPU limit — causing erratic latency.
Work Stealing
When a P’s local queue is empty, it steals work from other Ps:
- Check local run queue
- Check global run queue
- Check network poller (for goroutines unblocked by I/O)
- Steal from another P’s local queue
This is why Go scales well across cores — idle Ps don’t stay idle.
But it also means goroutine scheduling isn’t perfectly fair. A goroutine on a busy P’s queue might wait longer than one on an idle P’s queue.
Preemption
Before Go 1.14, goroutines were only preempted at function calls. A tight loop without function calls could starve other goroutines:
// Pre-1.14: this would never yield
go func() {
for {
// tight computation loop
x = x*2 + 1
}
}()
Go 1.14+ uses asynchronous preemption via signals. The runtime can interrupt any goroutine, even in tight loops. But this introduces overhead — each preemption point involves signal handling.
What Happens with 1M Goroutines
Each goroutine starts with a 2KB stack (grows as needed). A million goroutines = ~2GB of stack memory alone.
func main() {
var wg sync.WaitGroup
for i := 0; i < 1_000_000; i++ {
wg.Add(1)
go func() {
defer wg.Done()
time.Sleep(time.Minute)
}()
}
wg.Wait()
}
This works — Go handles it. But the scheduler overhead becomes significant:
- Scheduling latency: with 1M goroutines and 8 Ps, each P manages ~125K goroutines. Context switching overhead adds up.
- GC pauses: the GC must scan all goroutine stacks. More goroutines = longer GC pauses.
- Memory: stacks + scheduler metadata. Even sleeping goroutines consume memory.
Practical Limits
In my experience:
- < 10K goroutines: no issues, don’t think about it
- 10K-100K goroutines: monitor GC pauses and scheduling latency
- 100K+ goroutines: consider whether you actually need that many, or if a worker pool pattern is better
// Instead of 1M goroutines
for _, task := range tasks {
go process(task)
}
// Use a bounded worker pool
pool := make(chan struct{}, 1000) // 1000 concurrent workers
for _, task := range tasks {
pool <- struct{}{}
go func(t Task) {
defer func() { <-pool }()
process(t)
}(task)
}
Diagnosing Scheduler Issues
1. Schedule latency histogram:
func measureScheduleLatency() {
go func() {
for {
start := time.Now()
runtime.Gosched() // Yield and measure how long until rescheduled
latency := time.Since(start)
scheduleLatency.Observe(latency.Seconds())
time.Sleep(100 * time.Millisecond)
}
}()
}
If Gosched() takes more than a few microseconds, your Ps are overloaded.
2. Goroutine count over time:
goroutineGauge.Set(float64(runtime.NumGoroutine()))
If this trends upward, you have a goroutine leak.
3. Execution tracer:
curl http://localhost:6060/debug/pprof/trace?seconds=5 > trace.out
go tool trace trace.out
The trace viewer shows exactly what each P is doing — scheduling, GC, blocking, running. It’s the most powerful tool for understanding scheduler behavior.
Key Takeaways
- Set
GOMAXPROCScorrectly in containers - Use worker pools instead of unbounded goroutine spawning
- Monitor goroutine count — leaks are silent killers
- GC pauses scale with goroutine count — keep stacks shallow
- Use
go tool tracefor deep scheduler analysis - The scheduler is good, but it’s not magic — give it reasonable workloads
Understanding the scheduler doesn’t change how you write most code. But when something is slow and you can’t figure out why, this knowledge is the difference between a 5-minute fix and a 5-day investigation.