MLOps Interview Questions — Model Serving, Monitoring, CI/CD | AmanAI Lab

mid

What is vLLM and why is it important for LLM serving?

Model Answer

vLLM is an LLM serving framework that implements PagedAttention — a memory management technique inspired by OS paging that dramatically improves GPU memory utilization for KV-cache. Traditional serving wastes memory due to fragmentation from variable-length sequences. PagedAttention stores KV-cache in non-contiguous blocks, allowing up to 24x more throughput than HuggingFace Transformers. Also supports: continuous batching, tensor parallelism, quantization (GPTQ, AWQ). Widely used in production at scale for cost-effective inference.

mid

What is A/B testing for ML models and how do you implement it?

Model Answer

Split traffic between model A (control) and model B (experiment) to compare performance under real conditions. Implementation: traffic splitting at the API gateway level (10% to B, 90% to A), track metrics per variant with consistent user assignment (hash user_id to ensure same user always gets same variant), run for sufficient duration (power analysis to determine sample size), use hypothesis testing (t-test for continuous metrics, chi-square for categorical). Pitfalls: novelty effect (users behave differently with new things), leakage between variants, network effects.

mid

What is the difference between batch inference and online inference?

Model Answer

Batch inference: process a large dataset offline, results stored for later use. Pros: maximize throughput, use large batch sizes, cost-effective, GPU utilization can hit 90%+. Cons: latency is hours/days, not real-time. Use for: weekly recommendation updates, offline evaluation, data pipelines. Online inference: real-time, low-latency (<100ms) response to individual requests. Pros: immediate results. Cons: lower GPU utilization (must size for peak load), higher cost per query, need to handle variable load. Use for: chatbots, search, content moderation. Streaming inference: online inference with token streaming for LLMs improves perceived latency.

senior

How do you detect and handle model drift in production?

Model Answer

Data drift: input distribution changes (feature statistics, text topics). Concept drift: relationship between inputs and outputs changes. Detection methods: statistical tests (KS test, PSI — Population Stability Index), embedding drift (compare distributions in vector space), prediction drift (monitor output distribution), custom metrics (accuracy on labeled samples). Responses: automated retraining pipelines triggered by drift signals, shadow deployment of new model, A/B testing, canary deployment. Tools: Evidently AI, WhyLabs, Arize, Fiddler. Track: data quality metrics, model performance metrics, operational metrics.

mid

What is A/B testing for ML models and how do you implement it?

Model Answer

Split traffic between model A (control) and model B (experiment) to compare performance under real conditions. Implementation: traffic splitting at the API gateway level (10% to B, 90% to A), track metrics per variant with consistent user assignment (hash user_id to ensure same user always gets same variant), run for sufficient duration (power analysis to determine sample size), use hypothesis testing (t-test for continuous metrics, chi-square for categorical). Pitfalls: novelty effect (users behave differently with new things), leakage between variants, network effects.

senior

How do you cost-optimize an LLM service handling 10M requests/day?

Model Answer

Top levers in order of impact: (1) Cache aggressively — prefix caching for shared system prompts (50-90% off cached tokens), exact-match response caching for FAQs (1000x cheaper than re-running). (2) Route by complexity — simple queries to Haiku/3.5-mini, complex to Opus/4. (3) Quantize self-hosted models to INT4 (AWQ/GPTQ) → 4× throughput with <1% quality loss. (4) Continuous batching (vLLM / TGI) — 10-20× throughput vs naive. (5) Streaming responses to free GPU memory faster. (6) Right-size context — trim history aggressively, summarize old turns. (7) Monitor cost per request as a first-class metric; alert on anomalies.

senior

How do you detect and handle model drift in production?

Model Answer

Data drift: input distribution changes (feature statistics, text topics). Concept drift: relationship between inputs and outputs changes. Detection methods: statistical tests (KS test, PSI — Population Stability Index), embedding drift (compare distributions in vector space), prediction drift (monitor output distribution), custom metrics (accuracy on labeled samples). Responses: automated retraining pipelines triggered by drift signals, shadow deployment of new model, A/B testing, canary deployment. Tools: Evidently AI, WhyLabs, Arize, Fiddler. Track: data quality metrics, model performance metrics, operational metrics.

senior

What metrics do you track for an LLM in production?

Model Answer

Quality metrics: task-specific accuracy (e.g., BLEU, ROUGE, answer correctness), user satisfaction ratings, hallucination rate, safety filter triggers. Operational metrics: latency (P50, P95, P99 TTFT — Time To First Token, and TPOT — Time Per Output Token), throughput (tokens/second), error rates, retry rates. Cost metrics: cost per query, GPU utilization, tokens per dollar. Business metrics: task completion rate, user engagement, escalation rate. Alert on: latency spikes, accuracy drops, unusual output distribution, cost anomalies.

mid

What is the difference between batch inference and online inference?

Model Answer

Batch inference: process a large dataset offline, results stored for later use. Pros: maximize throughput, use large batch sizes, cost-effective, GPU utilization can hit 90%+. Cons: latency is hours/days, not real-time. Use for: weekly recommendation updates, offline evaluation, data pipelines. Online inference: real-time, low-latency (<100ms) response to individual requests. Pros: immediate results. Cons: lower GPU utilization (must size for peak load), higher cost per query, need to handle variable load. Use for: chatbots, search, content moderation. Streaming inference: online inference with token streaming for LLMs improves perceived latency.

mid

What is the difference between ML training and inference infrastructure?

Model Answer

Training: high-memory GPUs (A100 80GB, H100), need fast interconnect (NVLink, InfiniBand) for multi-GPU, high storage I/O for training data, spot instances acceptable (with checkpointing), synchronous batch processing. Inference: optimize for latency and throughput, can use quantized models (INT8, INT4), batching requests, horizontal scaling, need SLA guarantees, support for dynamic batching (vLLM, TGI), KV-cache management. Inference cost is 10-100x the amortized training cost at scale, so optimization is critical.

mid

What is shadow deployment and when do you use it?

Model Answer

Shadow deployment runs a new model in parallel with the production model: every request is sent to BOTH, but only the production model's response is returned to the user. The new model's outputs are logged for offline analysis. Use it to: validate a new model's quality on real production traffic without risk, compare latency / cost at realistic load, build a labeled dataset (production response = ground truth proxy). Once you trust the shadow model, ramp via canary (5% → 25% → 100%). Shadow doubles your inference cost during the test, but it's the safest way to validate a model swap.

mid

What is vLLM and why is it important for LLM serving?

Model Answer

vLLM is an LLM serving framework that implements PagedAttention — a memory management technique inspired by OS paging that dramatically improves GPU memory utilization for KV-cache. Traditional serving wastes memory due to fragmentation from variable-length sequences. PagedAttention stores KV-cache in non-contiguous blocks, allowing up to 24x more throughput than HuggingFace Transformers. Also supports: continuous batching, tensor parallelism, quantization (GPTQ, AWQ). Widely used in production at scale for cost-effective inference.

senior

What metrics do you track for an LLM in production?

Model Answer

Quality metrics: task-specific accuracy (e.g., BLEU, ROUGE, answer correctness), user satisfaction ratings, hallucination rate, safety filter triggers. Operational metrics: latency (P50, P95, P99 TTFT — Time To First Token, and TPOT — Time Per Output Token), throughput (tokens/second), error rates, retry rates. Cost metrics: cost per query, GPU utilization, tokens per dollar. Business metrics: task completion rate, user engagement, escalation rate. Alert on: latency spikes, accuracy drops, unusual output distribution, cost anomalies.

mid

What is the difference between ML training and inference infrastructure?

Model Answer

Training: high-memory GPUs (A100 80GB, H100), need fast interconnect (NVLink, InfiniBand) for multi-GPU, high storage I/O for training data, spot instances acceptable (with checkpointing), synchronous batch processing. Inference: optimize for latency and throughput, can use quantized models (INT8, INT4), batching requests, horizontal scaling, need SLA guarantees, support for dynamic batching (vLLM, TGI), KV-cache management. Inference cost is 10-100x the amortized training cost at scale, so optimization is critical.