System Design Fundamentals Every Developer Should Know

Most developers can build a feature. Far fewer can design a system that handles 10 million users, survives a database failure, and stays fast under load. System design is the skill that separates senior engineers from everyone else — and it is almost never taught in tutorials.

This guide covers the core concepts with real architecture decisions, trade-offs, and code. Not theory for its own sake — the things you actually need when designing production systems.

"Any fool can write code that a computer can understand. Good programmers write code that humans can understand. Great engineers design systems that survive reality." — paraphrased from Martin Fowler

1. The Building Blocks: What Every System Is Made Of

Before designing anything, you need to know the components available to you.

Component	What it does	When to use it
Load Balancer	Distributes traffic across servers	Any system with multiple app instances
CDN	Serves static assets from edge nodes	Images, JS, CSS, video
Cache	Stores frequently accessed data in memory	Read-heavy workloads, expensive queries
Message Queue	Decouples producers from consumers	Async tasks, event-driven systems
Database	Persistent storage	Everything that needs to survive restarts
Search Engine	Full-text and faceted search	Product search, log analysis

A typical production system looks like this:

text
Client
  ↓
CDN (static assets)
  ↓
Load Balancer (nginx / AWS ALB)
  ↓
App Servers (horizontal scale)
  ↓         ↓
Cache     Message Queue
(Redis)   (Kafka/SQS)
  ↓         ↓
Primary DB  Workers
(Postgres)
  ↓
Read Replicas

2. Horizontal vs Vertical Scaling

The first scaling decision you will face.

Vertical scaling — give the server more CPU, RAM, and disk. Simple, but has a hard ceiling and creates a single point of failure.

Horizontal scaling — add more servers. Theoretically unlimited, but requires your application to be stateless.

text
// Stateful server — CANNOT scale horizontally
// Session stored in memory — only works on one instance
app.post('/login', (req, res) => {
  const user = authenticate(req.body);
  req.session.userId = user.id;  // In-memory session
  res.json({ success: true });
});

// Stateless server — CAN scale horizontally
// Session stored in Redis — shared across all instances
import { Redis } from 'ioredis';
import { v4 as uuid } from 'uuid';

const redis = new Redis(process.env.REDIS_URL);

app.post('/login', async (req, res) => {
  const user = await authenticate(req.body);
  const sessionId = uuid();

  // Store session in Redis — accessible from any app instance
  await redis.setex(`session:${sessionId}`, 86400, JSON.stringify({
    userId: user.id,
    email: user.email,
    role: user.role,
  }));

  res.cookie('session_id', sessionId, { httpOnly: true, secure: true });
  res.json({ success: true });
});

The golden rule of horizontal scaling: your application servers must be stateless. All state — sessions, uploads, locks — must live in a shared external store (Redis, S3, database). If restarting any server loses data, you have a stateful server.

3. Caching: The Biggest Performance Lever

Caching is the single most impactful optimization in distributed systems. A cache hit is 100-1000x faster than a database query.

Cache-Aside Pattern (Most Common)

text
import { Redis } from 'ioredis';
import { db } from './database';

const redis = new Redis(process.env.REDIS_URL);

async function getUserById(userId: string) {
  const cacheKey = `user:${userId}`;

  // 1. Check cache first
  const cached = await redis.get(cacheKey);
  if (cached) {
    return JSON.parse(cached); // Cache hit — ~0.1ms
  }

  // 2. Cache miss — query database (~10-50ms)
  const user = await db.users.findUnique({ where: { id: userId } });
  if (!user) return null;

  // 3. Store in cache with TTL
  await redis.setex(cacheKey, 3600, JSON.stringify(user)); // 1 hour TTL

  return user;
}

// Invalidate cache when data changes
async function updateUser(userId: string, data: Partial<User>) {
  const updated = await db.users.update({ where: { id: userId }, data });

  // Delete cache entry — next read will repopulate
  await redis.del(`user:${userId}`);

  return updated;
}

Cache Stampede Prevention

When a popular cache key expires, thousands of requests can hit the database simultaneously. Use a lock to prevent this.

text
async function getUserWithLock(userId: string) {
  const cacheKey = `user:${userId}`;
  const lockKey = `lock:user:${userId}`;

  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);

  // Try to acquire lock (expires in 5 seconds)
  const lockAcquired = await redis.set(lockKey, '1', 'EX', 5, 'NX');

  if (lockAcquired) {
    // We have the lock — fetch from DB and populate cache
    const user = await db.users.findUnique({ where: { id: userId } });
    await redis.setex(cacheKey, 3600, JSON.stringify(user));
    await redis.del(lockKey);
    return user;
  } else {
    // Another instance is fetching — wait and retry
    await new Promise(resolve => setTimeout(resolve, 100));
    return getUserWithLock(userId); // Retry
  }
}

Cache invalidation is one of the two hard problems in computer science (the other being naming things). Always think about what happens when cached data becomes stale. TTL-based expiry is simple but can serve stale data. Event-based invalidation is precise but complex.

4. Database Design: The Foundation

Your database schema is the hardest thing to change later. Get it right early.

Indexing Strategy

text
-- Without index: full table scan O(n)
-- With index: B-tree lookup O(log n)

-- Always index foreign keys
CREATE INDEX idx_orders_user_id ON orders(user_id);

-- Composite index for common query patterns
-- This index supports: WHERE status = ? AND created_at > ?
CREATE INDEX idx_orders_status_created ON orders(status, created_at DESC);

-- Partial index for filtered queries (much smaller, faster)
-- Only indexes active orders — not the millions of completed ones
CREATE INDEX idx_active_orders ON orders(user_id, created_at)
WHERE status = 'active';

-- Covering index — query satisfied entirely from index, no table lookup
CREATE INDEX idx_user_email_name ON users(email) INCLUDE (name, avatar_url);

Read Replicas for Scale

text
import { PrismaClient } from '@prisma/client';

// Primary: handles all writes
const primaryDb = new PrismaClient({
  datasources: { db: { url: process.env.DATABASE_PRIMARY_URL } },
});

// Replica: handles reads (can have multiple)
const replicaDb = new PrismaClient({
  datasources: { db: { url: process.env.DATABASE_REPLICA_URL } },
});

// Route reads to replica, writes to primary
async function getProducts(filters: ProductFilters) {
  return replicaDb.product.findMany({ where: filters }); // Read replica
}

async function createOrder(data: CreateOrderInput) {
  return primaryDb.order.create({ data }); // Primary
}

Read replicas typically lag 10-100ms behind the primary. For most reads this is fine. For reads that immediately follow a write (e.g., "show me the order I just placed"), always read from the primary to avoid seeing stale data.

5. Message Queues: Decoupling for Resilience

When a user places an order, you need to: charge their card, send a confirmation email, update inventory, notify the warehouse, and record analytics. Doing all of this synchronously in the request handler is fragile — one failure blocks everything.

Message queues decouple these concerns.

text
// Producer: order service publishes an event
import { Kafka } from 'kafkajs';

const kafka = new Kafka({ brokers: [process.env.KAFKA_BROKER!] });
const producer = kafka.producer();

async function placeOrder(orderData: CreateOrderInput) {
  // 1. Save order to database (synchronous — user needs confirmation)
  const order = await db.orders.create({ data: orderData });

  // 2. Publish event — fire and forget
  await producer.send({
    topic: 'order.created',
    messages: [{
      key: order.id,
      value: JSON.stringify({
        orderId: order.id,
        userId: order.userId,
        items: order.items,
        total: order.total,
        timestamp: new Date().toISOString(),
      }),
    }],
  });

  // 3. Return immediately — don't wait for email, inventory, etc.
  return order;
}

text
// Consumer: email service subscribes independently
const consumer = kafka.consumer({ groupId: 'email-service' });

async function startEmailConsumer() {
  await consumer.subscribe({ topic: 'order.created', fromBeginning: false });

  await consumer.run({
    eachMessage: async ({ message }) => {
      const order = JSON.parse(message.value!.toString());

      try {
        await sendOrderConfirmationEmail(order);
        console.log(`Email sent for order ${order.orderId}`);
      } catch (error) {
        // Failed messages can be retried — the queue persists them
        console.error(`Email failed for order ${order.orderId}:`, error);
        throw error; // Kafka will retry
      }
    },
  });
}

// Inventory service subscribes to the same event independently
// Analytics service subscribes independently
// Warehouse service subscribes independently
// All decoupled — one failure doesn't affect the others

The key benefit: if your email service goes down for an hour, no orders are lost. When it comes back up, it processes the backlog from the queue. Without a queue, those emails would be gone forever.

6. API Design: REST vs GraphQL vs tRPC

	REST	GraphQL	tRPC
Best for	Public APIs, mobile clients	Complex data graphs, flexible queries	Full-stack TypeScript monorepos
Type safety	Manual (OpenAPI)	Schema-based	End-to-end automatic
Over-fetching	Common	Solved by design	Solved by design
Learning curve	Low	Medium	Low (if you know TypeScript)
Caching	Easy (HTTP cache)	Complex	Easy

tRPC: End-to-End Type Safety

text
// server/routers/user.ts
import { z } from 'zod';
import { router, protectedProcedure, publicProcedure } from '../trpc';

export const userRouter = router({
  // Public query — no auth required
  getProfile: publicProcedure
    .input(z.object({ username: z.string() }))
    .query(async ({ input }) => {
      return db.users.findUnique({
        where: { username: input.username },
        select: { id: true, name: true, bio: true, avatar: true },
      });
    }),

  // Protected mutation — requires authentication
  updateProfile: protectedProcedure
    .input(z.object({
      name: z.string().min(1).max(100),
      bio: z.string().max(500).optional(),
    }))
    .mutation(async ({ input, ctx }) => {
      return db.users.update({
        where: { id: ctx.user.id },
        data: input,
      });
    }),
});

// client/components/Profile.tsx
// Zero boilerplate — types flow automatically from server to client
import { trpc } from '@/lib/trpc';

function ProfileEditor() {
  const { data: profile } = trpc.user.getProfile.useQuery({ username: 'vighnesh' });
  const updateProfile = trpc.user.updateProfile.useMutation();

  // profile.name is typed — TypeScript knows the exact shape
  // updateProfile.mutate() is typed — wrong input = compile error
  return (
    <form onSubmit={e => {
      e.preventDefault();
      updateProfile.mutate({ name: 'New Name', bio: 'Updated bio' });
    }}>
      <input defaultValue={profile?.name} name="name" />
      <button type="submit">Save</button>
    </form>
  );
}

7. The CAP Theorem: The Fundamental Trade-off

Every distributed system must choose two of three guarantees:

text
         Consistency
        (every read gets
        the latest write)
              /\
             /  \
            /    \
           /  CA  \
          /        \
         /----CP----|
        /     |     \
       /      |      \
      /   AP  |  CP   \
     /________|________\
Availability          Partition
(system stays up      Tolerance
during failures)      (survives
                      network splits)

CP systems (MongoDB, HBase): Consistent and partition-tolerant. Will refuse requests rather than return stale data. Good for financial systems.
AP systems (Cassandra, DynamoDB): Available and partition-tolerant. Will return potentially stale data rather than go down. Good for social feeds, shopping carts.
CA systems (traditional RDBMS): Consistent and available. Cannot survive network partitions — only viable in single-node setups.

In practice, network partitions always happen eventually. So the real choice is between CP and AP — do you want consistency or availability when the network fails? Most web applications choose AP and handle eventual consistency in the application layer.

8. Rate Limiting at Scale

text
// Distributed rate limiting with Redis sliding window
import { Redis } from 'ioredis';

const redis = new Redis(process.env.REDIS_URL);

interface RateLimitResult {
  allowed: boolean;
  remaining: number;
  resetAt: number;
}

async function checkRateLimit(
  identifier: string,
  limit: number,
  windowSeconds: number
): Promise<RateLimitResult> {
  const now = Date.now();
  const windowStart = now - windowSeconds * 1000;
  const key = `ratelimit:${identifier}`;

  // Atomic Lua script — prevents race conditions
  const script = `
    local key = KEYS[1]
    local now = tonumber(ARGV[1])
    local window_start = tonumber(ARGV[2])
    local limit = tonumber(ARGV[3])
    local window_seconds = tonumber(ARGV[4])

    -- Remove expired entries
    redis.call('ZREMRANGEBYSCORE', key, '-inf', window_start)

    -- Count current requests in window
    local count = redis.call('ZCARD', key)

    if count < limit then
      -- Add current request
      redis.call('ZADD', key, now, now)
      redis.call('EXPIRE', key, window_seconds)
      return {1, limit - count - 1}
    else
      return {0, 0}
    end
  `;

  const result = await redis.eval(
    script, 1, key,
    now, windowStart, limit, windowSeconds
  ) as [number, number];

  return {
    allowed: result[0] === 1,
    remaining: result[1],
    resetAt: now + windowSeconds * 1000,
  };
}

9. Watch: System Design Interview Masterclass

Watch on YouTube

10. Designing for Failure

The most important mindset shift in distributed systems: assume everything will fail. Servers crash. Networks partition. Databases go down. Disks fill up. Design for it.

Circuit Breaker — If a downstream service is failing, stop calling it. Return a cached response or a graceful error. Let it recover before retrying.

Retry with Exponential Backoff — Retry failed requests, but wait longer between each attempt. Add jitter (random delay) to prevent thundering herd.

Timeouts Everywhere — Every network call must have a timeout. A hanging request that never times out will eventually exhaust your connection pool.

Graceful Degradation — If the recommendation service is down, show popular items instead. If the search service is down, show a message. Never let one failure cascade into a total outage.

text
// Retry with exponential backoff and jitter
async function withRetry<T>(
  fn: () => Promise<T>,
  maxAttempts = 3,
  baseDelayMs = 100
): Promise<T> {
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (attempt === maxAttempts) throw error;

      // Exponential backoff with jitter
      const delay = baseDelayMs * Math.pow(2, attempt - 1);
      const jitter = Math.random() * delay * 0.1;
      await new Promise(resolve => setTimeout(resolve, delay + jitter));
    }
  }
  throw new Error('Unreachable');
}

// Usage
const user = await withRetry(() => externalUserService.getUser(userId));

The System Design Checklist

Before any architecture review, ask:

Where are the single points of failure?
What happens when the database goes down?
How does this scale from 1,000 to 1,000,000 users?
What is the read/write ratio? (Informs caching strategy)
What consistency guarantees does the business actually need?
What is the acceptable latency at p99?
How do we monitor and alert when things go wrong?

System design is not about finding the perfect architecture. It is about making explicit trade-offs and being honest about what you are optimizing for.

System Design Fundamentals Every Developer Should Know

System Design Fundamentals Every Developer Should Know

1. The Building Blocks: What Every System Is Made Of

2. Horizontal vs Vertical Scaling

3. Caching: The Biggest Performance Lever

Cache-Aside Pattern (Most Common)

Cache Stampede Prevention

4. Database Design: The Foundation

Indexing Strategy

Read Replicas for Scale

5. Message Queues: Decoupling for Resilience

6. API Design: REST vs GraphQL vs tRPC

tRPC: End-to-End Type Safety

7. The CAP Theorem: The Fundamental Trade-off

8. Rate Limiting at Scale

9. Watch: System Design Interview Masterclass

10. Designing for Failure

The System Design Checklist

Vighnesh Salunkhe

Join the Conversation

Share this article