Design LLM Serving at Scale — System Design Practice | AmanAI Lab

Problems›Design LLM Serving at Scale

OpenAIGoogleMetaMSFT

45:00

Hard

The Problem

Problem

Design a production LLM inference system that can handle 100,000 requests per day with:

Your system should serve both interactive (real-time) and batch (async) workloads.

In a real FAANG interview, the interviewer will probe:

Scale & Constraints

Must Cover0/10

Hints (if stuck)

💡 Each A100 80GB can serve ~24 concurrent requests with PagedAttention vs ~2-3 without it.

💡 vLLM's continuous batching removes the "wait for all to finish" bottleneck.

💡 Semantic cache: embed the prompt, search for similar past prompts at similarity ≥ 0.97.

💡 Think about per-tenant rate limits before the GPU queue, not after.

💡 Streaming changes the latency metric from total time to time-to-first-token (TTFT).

0 words · auto-saved

Problem

Design a production LLM inference system that can handle 100,000 requests per day with:

•P99 latency ≤ 2 seconds for a 70B parameter model

•Support for streaming responses (SSE/WebSockets)

•Cost-efficient GPU utilisation

•Graceful handling of traffic spikes

What you'll be assessed on

•How you manage KV cache across concurrent requests

•Your batching strategy (static vs continuous batching)

•GPU cost optimisation (spot instances, quantisation, caching)

•Failure recovery and load shedding under pressure