Scalability in System Design

Scalability in System Design: Why Your App Survives 10 Users but Dies at 10 Million Series: System Design Mastery — Day 1 of 15 Reading time: 10 min Covers: Vertical vs Horizontal Scaling, Auto-Scaling, Stateless Design

The Day Instagram Almost Died

In 2011, Instagram had 13 employees and 30 million users. Then they launched on Android.

In 24 hours, they got 1 million new signups.

Their servers didn't die. Not because they were lucky — but because they had already solved the hardest question in system design:

"What happens when 100x more people show up tomorrow?"

That question is called scalability. And today, we're going to understand it the way senior engineers do — not just the definition, but the why, the when, and what actually breaks.

What Is Scalability? (In Plain English)

Scalability is your system's ability to handle growing load without breaking or slowing down.

Load can mean:

More users (1K → 1M)

More data (GB → PB)

More requests per second (100 QPS → 100,000 QPS)

A scalable system doesn't just survive growth — it handles it gracefully, ideally without you waking up at 3am.

The Two Paths: Vertical vs Horizontal Scaling

Imagine you run a restaurant. On a busy Friday night, you have two options:

Option A: Hire one superhuman chef who can cook 10x faster.

Option B: Hire 10 regular chefs and divide the work.

That's the exact difference between vertical and horizontal scaling.

Vertical Scaling (Scale Up)

What it is: Make your existing server bigger — more CPU, more RAM, faster disk.

Before: [Server: 4 CPU, 16GB RAM]

After: [Server: 32 CPU, 256GB RAM]

When it works: Early stage. Simple systems. Databases that are hard to distribute (like PostgreSQL). The early Instagram ran on a few beefy EC2 instances — vertical scaling bought them time.

Why it hits a wall:

There's a physical limit to how big one machine can get.

It's expensive — doubling RAM doesn't double your price; it 5x's it.

Single point of failure — if that one big server dies, everything dies. You have to take downtime to upgrade it.

The ceiling is real. You can scale vertically to a point, but you will hit it.

Horizontal Scaling (Scale Out)

What it is: Add more servers and spread the load across them.

Before: [Server 1]

After: [Server 1] [Server 2] [Server 3] ... [Server N]

A Load Balancer sits in front and distributes incoming requests across all servers.

Why Netflix uses horizontal scaling: Netflix runs on tens of thousands of servers. When they add capacity, they don't upgrade existing machines — they spin up new ones. When load drops, they terminate them. This is elastic, cost-efficient, and there's no theoretical ceiling.

The catch: Horizontal scaling introduces complexity. If a user logs in on Server 1, their session is on Server 1. If the next request goes to Server 3 — Server 3 knows nothing about them. This is the stateless design problem, and we'll solve it in a moment.

Side-by-Side Comparison

	Vertical Scaling	Horizontal Scaling
How	Bigger server	More servers
Cost	Exponential	Linear
Ceiling	Hard limit	Virtually none
Failure risk	High (single server)	Low (distributed)
Complexity	Low	Higher
Best for	Databases, early stage	APIs, stateless services

The Real Unlock: Stateless Design

Here's the architectural insight that makes horizontal scaling actually work.

Stateful server (problem): The server remembers things about you between requests. Your session, your cart, your login state — it's all stored in the server's memory.

User → Request 1 → Server A (stores session)

User → Request 2 → Server B (no session = logged out!)

Stateless server (solution): The server remembers nothing. All state lives outside the server — in a shared database, Redis cache, or JWT token that the client carries.

User → Request 1 → Server A (reads session from Redis ✓)

User → Request 2 → Server B (reads session from Redis ✓)

User → Request 3 → Server C (reads session from Redis ✓)

Now any server can handle any request. You can add 100 more servers tomorrow and they all work immediately — because none of them hold state.

This is the architectural principle behind every horizontally scalable system.

AWS Lambda, Kubernetes pods, Docker containers — they're all stateless by design.

Auto-Scaling: Letting the System Manage Itself

Manual scaling (someone clicking "add server" at 2am) doesn't work at scale. The answer is auto-scaling — the system detects load and adjusts itself automatically.

How it works (AWS Auto Scaling Group example):

CPU > 70% for 5 minutes → Add 2 servers

CPU < 30% for 10 minutes → Remove 1 server

Kubernetes Horizontal Pod Autoscaler (HPA):

#When CPU crosses 60%, spin up more pods

target:

kind: Deployment

name: my-api

metrics:

type: Resource

resource:

name: cpu

target:

type: Utilization

averageUtilization: 60

Real example: Netflix uses auto-scaling aggressively. On Sunday evenings (peak streaming time), their infrastructure automatically scales up. By 3am, it scales back down. They're not paying for idle servers — and nobody is manually managing this.

The 3 Failure Modes Nobody Talks About

Knowing what breaks makes you a better designer than knowing what works.

Stateful servers under horizontal scale You add servers but sessions break. Users get randomly logged out. Fix: externalize all state.
Database becomes the bottleneck You scaled your app servers 10x. Now 10x the queries are hitting one database. The DB becomes the ceiling. Fix: read replicas, caching, sharding (Day 3 topics).
Premature horizontal scaling A startup with 500 users adds a load balancer, 3 app servers, and Redis for sessions. Now they have 5x the infrastructure to maintain and debug. Fix: start vertical, switch to horizontal when you actually feel the pain.

Instagram's early lesson: They scaled vertically first (bigger servers) then horizontally (more servers). They didn't start distributed — they evolved to it.

Interview Scenario: "Estimate Servers for 1 Million DAU"

This is a real interview question. Here's how to answer it like a senior engineer:

Given: 1 million Daily Active Users (DAU)

Step 1: Calculate QPS

Assume each user makes 10 requests/day

Total requests/day = 1M × 10 = 10M requests/day

Average QPS = 10M ÷ 86,400 seconds ≈ 116 QPS

Peak QPS = 116 × 3 (peak multiplier) ≈ 350 QPS

Step 2: Estimate server capacity

A typical server handles ~500-1000 simple requests/sec

For 350 peak QPS → 1 server is enough

But: add redundancy (minimum 2), so → 2 servers

Step 3: Know when to go vertical vs horizontal

If your service is stateless (REST API) → horizontal If your service needs to maintain state or run single-threaded (DB, matching engine) → vertical first

The interviewer is testing: Do you estimate before designing? Do you know the difference between average and peak load? Do you know when each scaling strategy applies?

The Trade-off Triangle Every scalability decision trades off three things:

     Performance

          ▲

         /|\

        / | \

       /  |  \

      /   |   \

Cost ◄----+----► Simplicity

Vertical scaling: High simplicity, high cost at scale, performance ceiling.

Horizontal scaling: High performance, higher complexity, better cost curve.

Auto-scaling: Best performance/cost ratio, most complex to set up right.

There is no free lunch. Your job as an engineer is to pick the right trade-off for your current scale — not the theoretically perfect architecture.

Real Systems, Real Decisions

Company	Approach	Why
Netflix	Horizontal + Auto-scale	200M+ users, massive traffic spikes

Instagram (2010)	Vertical first	Small team, needed simplicity
Stack Overflow	Mostly vertical	Surprisingly few servers, heavily optimized
Google	Horizontal at extreme scale	Distributed across global data centers

Stack Overflow serving 1.5B requests/month with just 9 web servers is one of the most impressive vertical scaling stories in the industry — proof that "scale horizontally" isn't always the answer.

Key Takeaways

Scalability = handling more load without breaking. It's not optional — it's survival.

Vertical scaling is simple but has a ceiling. Use it early, when your team is small.

Horizontal scaling has no ceiling but requires stateless design.

Stateless design is the prerequisite for horizontal scaling — externalize all state.

Auto-scaling automates capacity management — essential at production scale.

The database is almost always the first bottleneck when you scale app servers. Always estimate (QPS, servers, storage) before designing. Numbers validate your design.

Scalability in System Design

#When CPU crosses 60%, spin up more pods

Comments

System Design

System Design - Availability & Reliability

More from this blog

System Design - Availability & Reliability

Ever wondered: "How Does ChatGPT Actually Generate Text?"

Command Palette

#When CPU crosses 60%, spin up more pods

Comments

System Design

System Design - Availability & Reliability

More from this blog