What 4+ Years of Backend Engineering Taught Me About System Design
System design isn’t about knowing which database to pick. It’s about understanding trade-offs deeply enough to make decisions you won’t regret in six months.
After 4+ years of building and maintaining production systems, here are the principles I keep coming back to.
Start with the Data Model
The data model is the most important decision in any system. Get it wrong and you’ll fight it forever. Get it right and everything else becomes easier.
Before writing any code, I ask:
- What are the entities?
- What are the relationships?
- What are the access patterns?
- What are the consistency requirements?
If your access patterns don’t match your data model, no amount of caching or clever queries will save you.
Design for the 99th Percentile, Not the Average
Average latency is misleading. If your P50 is 20ms but your P99 is 2 seconds, 1 in 100 users is having a terrible experience. At scale, that’s thousands of people.
Design decisions that matter for P99:
- Connection pooling (avoid cold connection overhead)
- Timeout budgets (don’t let slow calls cascade)
- Async processing (move slow work out of the request path)
- Read replicas (distribute read load)
The CAP Theorem Is About Trade-Offs, Not Choices
CAP isn’t “pick 2 out of 3.” It’s “when a network partition happens, do you favor consistency or availability?”
In practice:
- User-facing reads: favor availability (serve stale data over an error)
- Financial transactions: favor consistency (better to reject than double-charge)
- Internal service communication: it depends on the use case
Most systems need different consistency levels for different operations. Don’t pick one globally.
Queues Are the Most Underused Architecture Pattern
Every time I see a synchronous call chain that should be async, I add a queue. Queues solve:
- Decoupling: producer and consumer don’t need to be available simultaneously
- Buffering: absorb traffic spikes without backpressure to the user
- Retry: failed operations get retried without user involvement
- Rate limiting: consumers process at their own pace
The first time a deployment goes smoothly because the queue absorbed the traffic spike while the new version warmed up — you’ll be converted.
Caching Strategy Matters More Than Cache Technology
Redis vs Memcached vs in-process cache is a secondary decision. The caching strategy is what matters:
- Cache-aside: application manages the cache. Simple, most common.
- Write-through: writes go to cache and database. Consistent but slower writes.
- Write-behind: writes go to cache, async flush to database. Fast writes, risk of data loss.
And the invalidation strategy:
- TTL-based: simple, eventual consistency. Good enough for most cases.
- Event-based: publish cache invalidation events. More complex, faster consistency.
- Version-based: include a version number. Cache checks if version is current.
I default to cache-aside with TTL. It’s simple and handles 90% of caching needs.
Idempotency Is Not Optional
Every external-facing endpoint should be idempotent. Every message consumer should be idempotent. Every batch job should be idempotent.
This isn’t paranoia. It’s engineering for reality. Networks fail, messages duplicate, users double-click, deployments restart mid-operation.
The cost of making something idempotent is small. The cost of a non-idempotent operation failing is potentially catastrophic.
Observability Before Features
I’ve never regretted investing in observability early. I’ve often regretted not having it when something broke.
Before building the second feature, set up:
- Structured logging with request correlation
- Latency histograms for every endpoint
- Error rate monitoring with alerts
- Database query performance tracking
You’ll ship features slower for one sprint. You’ll ship features faster for every sprint after that because debugging takes minutes instead of hours.
Complexity Is the Real Enemy
Every line of code is a liability. Every service is an operational burden. Every dependency is a failure point.
Questions I ask before adding complexity:
- Can we solve this with the tools we already have?
- What’s the operational cost of this new component?
- Who will debug this at 3 AM?
- Can we delete this in 6 months if it doesn’t work out?
The best systems I’ve worked on are simple. Not simplistic — they handle real complexity. But the accidental complexity is minimal.
Migrations Are Harder Than Greenfield
Building a new system is fun. Migrating an existing system to a new architecture while serving live traffic is engineering.
The expand-contract pattern is the safest approach:
- Add new alongside old
- Gradually migrate traffic
- Remove old when confident
Every step is reversible. No big-bang cutover. Boring, systematic, and safe.
Document Decisions, Not Systems
Systems change. Documentation rots. But the reasoning behind decisions stays relevant.
I write ADRs (Architecture Decision Records) for every significant decision:
- What was the decision?
- What alternatives did we consider?
- Why did we choose this option?
- What are the trade-offs we accepted?
When a new engineer asks “why is it built this way?” — the ADR has the answer, with full context from the time the decision was made.
These principles aren’t original. They’ve been written about extensively. But there’s a gap between reading about trade-offs and feeling them in production. The gap closes with experience, incidents, and late-night debugging sessions.
The best system designers I know aren’t the ones who memorize architecture patterns. They’re the ones who’ve been burned enough times to develop intuition for what will hurt later.