Scaled Dot-Product Attention (Hard) | AI Code Lab

Scaled Dot-Product AttentionHard

00:00

Python idle

Scaled Dot-Product Attention

The core operation of every Transformer. Computes **weighted sum of values** based on query-key similarity.

Attention(Q, K, V) = softmax(Q × Kᵀ / √d_k) × V

1. Compute attention scores: Q × Kᵀ

2. Scale by √d_k (prevents vanishing softmax gradients)

3. Apply softmax row-wise → attention weights

4. Multiply by V → weighted context

Q = K = [[1,0],[0,1]]
V = [[1,2],[3,4]]
d_k = 2
→ Output ≈ [[1.66082, 2.66082], [2.33918, 3.33918]]

Round to **5 decimal places**.