Design a production LLM inference system that can handle 100,000 requests per day with:
- •P99 latency ≤ 2 seconds for a 70B parameter model
- •Support for streaming responses (SSE/WebSockets)
- •Cost-efficient GPU utilisation
- •Graceful handling of traffic spikes
Your system should serve both interactive (real-time) and batch (async) workloads.
What you'll be assessed on
In a real FAANG interview, the interviewer will probe:
- •How you manage KV cache across concurrent requests
- •Your batching strategy (static vs continuous batching)
- •GPU cost optimisation (spot instances, quantisation, caching)
- •Failure recovery and load shedding under pressure