The core operation of every Transformer. Computes **weighted sum of values** based on query-key similarity.
Attention(Q, K, V) = softmax(Q × Kᵀ / √d_k) × V1. Compute attention scores: Q × Kᵀ
2. Scale by √d_k (prevents vanishing softmax gradients)
3. Apply softmax row-wise → attention weights
4. Multiply by V → weighted context
Q = K = [[1,0],[0,1]] V = [[1,2],[3,4]] d_k = 2 → Output ≈ [[1.66082, 2.66082], [2.33918, 3.33918]]
Round to **5 decimal places**.
Test Cases (2 visible · 1 hidden)
Case 1: 2×2 identity Q,K
Input: attention([[1,0],[0,1]],[[1,0],[0,1]],[[1,2],[3,4]])
Expected: [[1.66082, 2.66082], [2.33918, 3.33918]]
Case 2: Single query
Input: attention([[1,0]],[[1,0],[0,1]],[[10,20],[30,40]])
Expected: [[16.60882, 26.60882]]
⌘↵ Run · ⌘⇧↵ Submit