Transformer Architecture Interview Questions | AmanAI Lab

mid

What is Grouped Query Attention (GQA) and why is it used in modern LLMs?

Model Answer

GQA is a compromise between Multi-Head Attention (MHA) and Multi-Query Attention (MQA). In MHA: each head has its own Q, K, V matrices. In MQA: all heads share a single K and V. In GQA: heads are grouped, and each group shares K and V — e.g., 32 query heads with 8 KV heads. Benefit: reduces KV-cache size during inference (critical bottleneck), allows faster inference with minimal quality loss. Used in Llama 2 70B, Llama 3, Mistral, Gemma. Typical configuration: 32Q heads, 8KV heads (4:1 ratio).

mid

What is the role of positional encoding in transformers?

Model Answer

Transformers have no inherent notion of token order (unlike RNNs). Positional encodings inject position information into token embeddings. Original paper (Vaswani 2017): sinusoidal encoding using sin/cos at different frequencies — allows the model to generalize to sequences longer than seen in training. Modern approaches: RoPE (Rotary Position Embedding, used in Llama) — encodes relative positions by rotating query/key vectors, ALiBi (Attention with Linear Biases, used in BLOOM) — adds a linear bias to attention scores based on token distance. RoPE is now dominant for its good extrapolation properties.

mid

Why are residual connections critical in transformers?

Model Answer

Residual connections (output = x + sublayer(x)) enable training of very deep networks by giving gradients a direct path back through the network — without them, gradients vanish through dozens of layers. In transformers specifically: they let each layer learn an INCREMENT to the previous representation instead of a full re-encoding, which is empirically much easier to optimize. Pre-norm (LayerNorm before sublayer, then add residual) is now standard because it stabilizes training of 100+ layer models — post-norm (the original Vaswani 2017 design) becomes unstable at scale.

mid

What is KV-cache and why is it critical for LLM inference performance?

Model Answer

During autoregressive generation, each new token attends to all previous tokens. Without caching: recompute K,V for all previous tokens at every step — O(n²) total work. KV-cache: store the K and V matrices computed for previous tokens, reuse them for each new step — O(n) total work. Memory cost: proportional to batch_size × sequence_length × n_layers × d_head. At scale: a 70B model with 10K context and batch size 64 needs ~140GB for KV-cache alone. Optimizations: PagedAttention (vLLM), KV-cache quantization (INT8/INT4), prefix caching (reuse prompts), sliding window attention.

senior

Explain Flash Attention and why it matters for training large models.

Model Answer

Flash Attention rewrites the attention operation to avoid materializing the full N×N attention matrix in GPU HBM (high-bandwidth memory). Instead, it computes attention in tiles that fit in SRAM (fast on-chip memory). Result: attention is now memory-bound → memory I/O bound rather than compute-bound. Reduces memory complexity from O(N²) to O(N) for the attention matrix. 2-4x faster than standard attention, enables longer context windows. Flash Attention 2 improved further with better parallelism. Flash Attention 3 (H100 optimized) is the current state of the art.

mid

What is the role of positional encoding in transformers?

Model Answer

Transformers have no inherent notion of token order (unlike RNNs). Positional encodings inject position information into token embeddings. Original paper (Vaswani 2017): sinusoidal encoding using sin/cos at different frequencies — allows the model to generalize to sequences longer than seen in training. Modern approaches: RoPE (Rotary Position Embedding, used in Llama) — encodes relative positions by rotating query/key vectors, ALiBi (Attention with Linear Biases, used in BLOOM) — adds a linear bias to attention scores based on token distance. RoPE is now dominant for its good extrapolation properties.

fresher

What is the purpose of the feedforward network in each transformer layer?

Model Answer

The FFN in each transformer layer (also called the MLP) adds non-linearity and provides additional model capacity. Architecture: two linear projections with a non-linear activation in between: FFN(x) = Activation(W1·x + b1)·W2 + b2. The expansion dimension is typically 4× the model dimension (e.g., d_model=1024, d_ff=4096). GELU activation is standard in modern transformers. The FFN is believed to act as a "key-value memory" that stores factual knowledge. Some research shows the FFN stores more world knowledge than the attention layers.

mid

What is KV-cache and why is it critical for LLM inference performance?

Model Answer

During autoregressive generation, each new token attends to all previous tokens. Without caching: recompute K,V for all previous tokens at every step — O(n²) total work. KV-cache: store the K and V matrices computed for previous tokens, reuse them for each new step — O(n) total work. Memory cost: proportional to batch_size × sequence_length × n_layers × d_head. At scale: a 70B model with 10K context and batch size 64 needs ~140GB for KV-cache alone. Optimizations: PagedAttention (vLLM), KV-cache quantization (INT8/INT4), prefix caching (reuse prompts), sliding window attention.

mid

What is Grouped Query Attention (GQA) and why is it used in modern LLMs?

Model Answer

GQA is a compromise between Multi-Head Attention (MHA) and Multi-Query Attention (MQA). In MHA: each head has its own Q, K, V matrices. In MQA: all heads share a single K and V. In GQA: heads are grouped, and each group shares K and V — e.g., 32 query heads with 8 KV heads. Benefit: reduces KV-cache size during inference (critical bottleneck), allows faster inference with minimal quality loss. Used in Llama 2 70B, Llama 3, Mistral, Gemma. Typical configuration: 32Q heads, 8KV heads (4:1 ratio).

senior

What is Mixture of Experts (MoE) and how does it enable efficient scaling?

Model Answer

MoE replaces dense FFN layers with a set of N "expert" FFNs, where only K experts (typically 2 of 8) are activated per token (controlled by a gating network). Benefits: model has more total parameters but the same compute per token (sparse activation). Mixtral 8x7B has 46.7B total params but only activates 12.9B per forward pass. GPT-4 is widely believed to use MoE with ~8 experts. Challenges: expert load balancing (add auxiliary loss to prevent collapse), communication overhead in distributed settings, training instability. Switch Transformer showed MoE scales better than dense models with equal compute.

fresher

What is the purpose of the feedforward network in each transformer layer?

Model Answer

The FFN in each transformer layer (also called the MLP) adds non-linearity and provides additional model capacity. Architecture: two linear projections with a non-linear activation in between: FFN(x) = Activation(W1·x + b1)·W2 + b2. The expansion dimension is typically 4× the model dimension (e.g., d_model=1024, d_ff=4096). GELU activation is standard in modern transformers. The FFN is believed to act as a "key-value memory" that stores factual knowledge. Some research shows the FFN stores more world knowledge than the attention layers.

senior

What is KV-cache quantization and how much memory does it save?

Model Answer

KV-cache (the stored K and V projections for already-generated tokens) dominates inference memory at long context: ~140 GB for Llama 70B at 8K context, batch 1. Quantizing the cache from FP16 to INT8 cuts memory in half with <1% perplexity hit; INT4 cuts it by 4× with ~2-3% perplexity hit. Tradeoff: smaller cache = larger batch size = much higher throughput, but quality drops slightly. Production systems often use FP8 / INT8 caches by default. Implementation in vLLM, TensorRT-LLM, llama.cpp. Critical for serving long-context models economically.

senior

Explain Flash Attention and why it matters for training large models.

Model Answer

Flash Attention rewrites the attention operation to avoid materializing the full N×N attention matrix in GPU HBM (high-bandwidth memory). Instead, it computes attention in tiles that fit in SRAM (fast on-chip memory). Result: attention is now memory-bound → memory I/O bound rather than compute-bound. Reduces memory complexity from O(N²) to O(N) for the attention matrix. 2-4x faster than standard attention, enables longer context windows. Flash Attention 2 improved further with better parallelism. Flash Attention 3 (H100 optimized) is the current state of the art.

senior

What is Mixture of Experts (MoE) and how does it enable efficient scaling?

Model Answer

MoE replaces dense FFN layers with a set of N "expert" FFNs, where only K experts (typically 2 of 8) are activated per token (controlled by a gating network). Benefits: model has more total parameters but the same compute per token (sparse activation). Mixtral 8x7B has 46.7B total params but only activates 12.9B per forward pass. GPT-4 is widely believed to use MoE with ~8 experts. Challenges: expert load balancing (add auxiliary loss to prevent collapse), communication overhead in distributed settings, training instability. Switch Transformer showed MoE scales better than dense models with equal compute.

Transformers Interview Questions