MLOps covers the full production ML lifecycle. Interviews test model serving, monitoring, CI/CD, feature stores, drift detection, and infrastructure design.
Key Concepts to Know
Practice MLOps with AI
Timed session with instant scoring, voice support, and model answers.
14 Interview Questions
Browse all topics →What is vLLM and why is it important for LLM serving?
Model Answer
vLLM is an LLM serving framework that implements PagedAttention — a memory management technique inspired by OS paging that dramatically improves GPU memory utilization for KV-cache. Traditional serving wastes memory due to fragmentation from variable-length sequences. PagedAttention stores KV-cache in non-contiguous blocks, allowing up to 24x more throughput than HuggingFace Transformers. Also supports: continuous batching, tensor parallelism, quantization (GPTQ, AWQ). Widely used in production at scale for cost-effective inference.
What is A/B testing for ML models and how do you implement it?
Model Answer
Split traffic between model A (control) and model B (experiment) to compare performance under real conditions. Implementation: traffic splitting at the API gateway level (10% to B, 90% to A), track metrics per variant with consistent user assignment (hash user_id to ensure same user always gets same variant), run for sufficient duration (power analysis to determine sample size), use hypothesis testing (t-test for continuous metrics, chi-square for categorical). Pitfalls: novelty effect (users behave differently with new things), leakage between variants, network effects.
What is the difference between batch inference and online inference?
Model Answer
Batch inference: process a large dataset offline, results stored for later use. Pros: maximize throughput, use large batch sizes, cost-effective, GPU utilization can hit 90%+. Cons: latency is hours/days, not real-time. Use for: weekly recommendation updates, offline evaluation, data pipelines. Online inference: real-time, low-latency (<100ms) response to individual requests. Pros: immediate results. Cons: lower GPU utilization (must size for peak load), higher cost per query, need to handle variable load. Use for: chatbots, search, content moderation. Streaming inference: online inference with token streaming for LLMs improves perceived latency.
How do you detect and handle model drift in production?
Model Answer
Data drift: input distribution changes (feature statistics, text topics). Concept drift: relationship between inputs and outputs changes. Detection methods: statistical tests (KS test, PSI — Population Stability Index), embedding drift (compare distributions in vector space), prediction drift (monitor output distribution), custom metrics (accuracy on labeled samples). Responses: automated retraining pipelines triggered by drift signals, shadow deployment of new model, A/B testing, canary deployment. Tools: Evidently AI, WhyLabs, Arize, Fiddler. Track: data quality metrics, model performance metrics, operational metrics.
What is A/B testing for ML models and how do you implement it?
Model Answer
Split traffic between model A (control) and model B (experiment) to compare performance under real conditions. Implementation: traffic splitting at the API gateway level (10% to B, 90% to A), track metrics per variant with consistent user assignment (hash user_id to ensure same user always gets same variant), run for sufficient duration (power analysis to determine sample size), use hypothesis testing (t-test for continuous metrics, chi-square for categorical). Pitfalls: novelty effect (users behave differently with new things), leakage between variants, network effects.
How do you cost-optimize an LLM service handling 10M requests/day?
Model Answer
Top levers in order of impact: (1) Cache aggressively — prefix caching for shared system prompts (50-90% off cached tokens), exact-match response caching for FAQs (1000x cheaper than re-running). (2) Route by complexity — simple queries to Haiku/3.5-mini, complex to Opus/4. (3) Quantize self-hosted models to INT4 (AWQ/GPTQ) → 4× throughput with <1% quality loss. (4) Continuous batching (vLLM / TGI) — 10-20× throughput vs naive. (5) Streaming responses to free GPU memory faster. (6) Right-size context — trim history aggressively, summarize old turns. (7) Monitor cost per request as a first-class metric; alert on anomalies.
How do you detect and handle model drift in production?
Model Answer
Data drift: input distribution changes (feature statistics, text topics). Concept drift: relationship between inputs and outputs changes. Detection methods: statistical tests (KS test, PSI — Population Stability Index), embedding drift (compare distributions in vector space), prediction drift (monitor output distribution), custom metrics (accuracy on labeled samples). Responses: automated retraining pipelines triggered by drift signals, shadow deployment of new model, A/B testing, canary deployment. Tools: Evidently AI, WhyLabs, Arize, Fiddler. Track: data quality metrics, model performance metrics, operational metrics.
What metrics do you track for an LLM in production?
Model Answer
Quality metrics: task-specific accuracy (e.g., BLEU, ROUGE, answer correctness), user satisfaction ratings, hallucination rate, safety filter triggers. Operational metrics: latency (P50, P95, P99 TTFT — Time To First Token, and TPOT — Time Per Output Token), throughput (tokens/second), error rates, retry rates. Cost metrics: cost per query, GPU utilization, tokens per dollar. Business metrics: task completion rate, user engagement, escalation rate. Alert on: latency spikes, accuracy drops, unusual output distribution, cost anomalies.
What is the difference between batch inference and online inference?
Model Answer
Batch inference: process a large dataset offline, results stored for later use. Pros: maximize throughput, use large batch sizes, cost-effective, GPU utilization can hit 90%+. Cons: latency is hours/days, not real-time. Use for: weekly recommendation updates, offline evaluation, data pipelines. Online inference: real-time, low-latency (<100ms) response to individual requests. Pros: immediate results. Cons: lower GPU utilization (must size for peak load), higher cost per query, need to handle variable load. Use for: chatbots, search, content moderation. Streaming inference: online inference with token streaming for LLMs improves perceived latency.
What is the difference between ML training and inference infrastructure?
Model Answer
Training: high-memory GPUs (A100 80GB, H100), need fast interconnect (NVLink, InfiniBand) for multi-GPU, high storage I/O for training data, spot instances acceptable (with checkpointing), synchronous batch processing. Inference: optimize for latency and throughput, can use quantized models (INT8, INT4), batching requests, horizontal scaling, need SLA guarantees, support for dynamic batching (vLLM, TGI), KV-cache management. Inference cost is 10-100x the amortized training cost at scale, so optimization is critical.
What is shadow deployment and when do you use it?
Model Answer
Shadow deployment runs a new model in parallel with the production model: every request is sent to BOTH, but only the production model's response is returned to the user. The new model's outputs are logged for offline analysis. Use it to: validate a new model's quality on real production traffic without risk, compare latency / cost at realistic load, build a labeled dataset (production response = ground truth proxy). Once you trust the shadow model, ramp via canary (5% → 25% → 100%). Shadow doubles your inference cost during the test, but it's the safest way to validate a model swap.
What is vLLM and why is it important for LLM serving?
Model Answer
vLLM is an LLM serving framework that implements PagedAttention — a memory management technique inspired by OS paging that dramatically improves GPU memory utilization for KV-cache. Traditional serving wastes memory due to fragmentation from variable-length sequences. PagedAttention stores KV-cache in non-contiguous blocks, allowing up to 24x more throughput than HuggingFace Transformers. Also supports: continuous batching, tensor parallelism, quantization (GPTQ, AWQ). Widely used in production at scale for cost-effective inference.
What metrics do you track for an LLM in production?
Model Answer
Quality metrics: task-specific accuracy (e.g., BLEU, ROUGE, answer correctness), user satisfaction ratings, hallucination rate, safety filter triggers. Operational metrics: latency (P50, P95, P99 TTFT — Time To First Token, and TPOT — Time Per Output Token), throughput (tokens/second), error rates, retry rates. Cost metrics: cost per query, GPU utilization, tokens per dollar. Business metrics: task completion rate, user engagement, escalation rate. Alert on: latency spikes, accuracy drops, unusual output distribution, cost anomalies.
What is the difference between ML training and inference infrastructure?
Model Answer
Training: high-memory GPUs (A100 80GB, H100), need fast interconnect (NVLink, InfiniBand) for multi-GPU, high storage I/O for training data, spot instances acceptable (with checkpointing), synchronous batch processing. Inference: optimize for latency and throughput, can use quantized models (INT8, INT4), batching requests, horizontal scaling, need SLA guarantees, support for dynamic batching (vLLM, TGI), KV-cache management. Inference cost is 10-100x the amortized training cost at scale, so optimization is critical.
Related Topics