ML System Design combines software architecture with machine learning. Interviews test recommendation systems, search, real-time ML, and production ML infrastructure at scale.
Key Concepts to Know
Practice System Design with AI
Timed session with instant scoring, voice support, and model answers.
10 Interview Questions
Browse all topics →How do you handle prompt injection attacks in LLM applications?
Model Answer
Prompt injection: malicious user input that overrides system instructions (e.g., "Ignore previous instructions and..."). Defenses: input validation (detect injection patterns with a classifier), output validation (check responses against expected format/content), use separate system/user turn clearly delimited, minimal privilege (agent has only necessary tools), sandboxing (don't let agents execute arbitrary code), monitoring for anomalous outputs, human-in-the-loop for sensitive operations, use a fine-tuned safety classifier before/after generation.
How would you build a real-time AI news summarization system?
Model Answer
Pipeline: 1) Data ingestion: scrape RSS feeds + news APIs every 5 min using async workers (Celery + Redis). 2) Deduplication: SimHash or MinHash to detect near-duplicate articles. 3) Topic clustering: embed articles, cluster similar ones (DBSCAN or agglomerative). 4) Summarization: prompt LLM to summarize cluster of related articles in 2-3 sentences. 5) Storage: summaries in Postgres with metadata, full articles in object storage. 6) Delivery: REST API + WebSocket for real-time updates to clients. Scale: rate limit news sources, use streaming summarization, cache summaries for popular topics.
How do you handle prompt injection attacks in LLM applications?
Model Answer
Prompt injection: malicious user input that overrides system instructions (e.g., "Ignore previous instructions and..."). Defenses: input validation (detect injection patterns with a classifier), output validation (check responses against expected format/content), use separate system/user turn clearly delimited, minimal privilege (agent has only necessary tools), sandboxing (don't let agents execute arbitrary code), monitoring for anomalous outputs, human-in-the-loop for sensitive operations, use a fine-tuned safety classifier before/after generation.
Design a multi-tenant LLM inference service with per-tenant quotas and isolation.
Model Answer
Architecture: API gateway (Kong / AWS API GW) authenticates tenants and enforces per-tenant rate limits (token bucket in Redis). Behind it, a request router maps tenants to model pools — small/cheap models share a pool, premium tenants get dedicated GPU pods. Each pool runs vLLM with continuous batching. Quotas: tokens/min and concurrent requests, tracked in Redis with sliding-window counters. Fairness: priority queue per pool so a noisy tenant can't starve others (vLLM has a priority param). Isolation: separate Kubernetes namespaces per tier, NetworkPolicies prevent cross-tenant traffic, separate logs per tenant for audit. Cost: route by tenant tier — free tier shares a Llama 3.1 8B pool with low priority; enterprise gets a dedicated 70B pod.
How would you design an LLM inference system to serve 10K concurrent users?
Model Answer
Use vLLM or TGI for continuous batching and PagedAttention (maximizes GPU utilization). Architecture: load balancer → N inference pods (each with 1-4 GPUs). For high throughput: batch requests, use async generation. Autoscaling: scale on GPU utilization or queue depth. For cost: use spot instances with fallback, quantize to INT8/INT4. Routing: route short queries to smaller models, long/complex to larger. KV-cache sharing between requests with same system prompt (prefix caching). Cache frequent/deterministic responses at the API gateway layer.
Design a production RAG system for a 10M document corpus with <500ms P95 latency.
Model Answer
Architecture: 1) Indexing pipeline: chunk docs → embed with batch GPU inference → upsert to vector DB (Pinecone/Qdrant) with metadata. 2) Serving: query → embed (fast, small model) → ANN search (HNSW index, ~10ms) → re-rank top-50 to top-5 → LLM generation. Latency breakdown: embedding ~20ms, retrieval ~10ms, reranking ~50ms, generation ~300ms = ~380ms. Optimizations: cache embeddings of frequent queries, pre-warm LLM, use streaming for generation, async retrieval + generation overlap. Scale: horizontal pod scaling for embedding service, read replicas for vector DB.
How would you build a real-time AI news summarization system?
Model Answer
Pipeline: 1) Data ingestion: scrape RSS feeds + news APIs every 5 min using async workers (Celery + Redis). 2) Deduplication: SimHash or MinHash to detect near-duplicate articles. 3) Topic clustering: embed articles, cluster similar ones (DBSCAN or agglomerative). 4) Summarization: prompt LLM to summarize cluster of related articles in 2-3 sentences. 5) Storage: summaries in Postgres with metadata, full articles in object storage. 6) Delivery: REST API + WebSocket for real-time updates to clients. Scale: rate limit news sources, use streaming summarization, cache summaries for popular topics.
How do you design an evaluation pipeline for a RAG system?
Model Answer
Build it in 4 layers. (1) Unit eval: synthetic Q&A pairs covering core capabilities — run on every PR. (2) Component eval: retrieval (NDCG@10, recall@5), generation (faithfulness, answer relevancy via RAGAS or LLM-as-judge). (3) End-to-end eval: real user queries replayed against golden answers. (4) Production: thumbs up/down on responses, sampled human review. Tooling: store every prompt/response pair with metadata in a feedback table; nightly batch job runs LLM-as-judge on a sample. Trigger retraining/reindexing alerts when faithfulness drops below threshold. Critical: keep the eval set frozen — if it changes, regressions become invisible.
Design a production RAG system for a 10M document corpus with <500ms P95 latency.
Model Answer
Architecture: 1) Indexing pipeline: chunk docs → embed with batch GPU inference → upsert to vector DB (Pinecone/Qdrant) with metadata. 2) Serving: query → embed (fast, small model) → ANN search (HNSW index, ~10ms) → re-rank top-50 to top-5 → LLM generation. Latency breakdown: embedding ~20ms, retrieval ~10ms, reranking ~50ms, generation ~300ms = ~380ms. Optimizations: cache embeddings of frequent queries, pre-warm LLM, use streaming for generation, async retrieval + generation overlap. Scale: horizontal pod scaling for embedding service, read replicas for vector DB.
How would you design an LLM inference system to serve 10K concurrent users?
Model Answer
Use vLLM or TGI for continuous batching and PagedAttention (maximizes GPU utilization). Architecture: load balancer → N inference pods (each with 1-4 GPUs). For high throughput: batch requests, use async generation. Autoscaling: scale on GPU utilization or queue depth. For cost: use spot instances with fallback, quantize to INT8/INT4. Routing: route short queries to smaller models, long/complex to larger. KV-cache sharing between requests with same system prompt (prefix caching). Cache frequent/deterministic responses at the API gateway layer.
Related Topics