Transformer Architecture

2D Card-Based Visualization

6

Encoder Stack

📥
Input Embedding
d_model = 512
Converts input tokens to dense vectors
🌊
Positional Encoding
Sinusoidal
Adds position information using sine/cosine functions
👁️
Multi-Head Attention
Self-Attention
8 heads attending to all positions
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
🔄
Feed Forward
2048 hidden dim
Two linear layers with ReLU
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
👁️
Multi-Head Attention
Self-Attention
8 heads attending to all positions
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
🔄
Feed Forward
2048 hidden dim
Two linear layers with ReLU
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
👁️
Multi-Head Attention
Self-Attention
8 heads attending to all positions
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
🔄
Feed Forward
2048 hidden dim
Two linear layers with ReLU
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
👁️
Multi-Head Attention
Self-Attention
8 heads attending to all positions
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
🔄
Feed Forward
2048 hidden dim
Two linear layers with ReLU
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
👁️
Multi-Head Attention
Self-Attention
8 heads attending to all positions
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
🔄
Feed Forward
2048 hidden dim
Two linear layers with ReLU
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
👁️
Multi-Head Attention
Self-Attention
8 heads attending to all positions
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
🔄
Feed Forward
2048 hidden dim
Two linear layers with ReLU
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization

Decoder Stack

📤
Output Embedding
d_model = 512
Embeds previously generated tokens
🌊
Positional Encoding
Sinusoidal
Adds position information to output embeddings
🎭
Masked Multi-Head Attention
Self-Attention
Prevents looking at future tokens
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
🔗
Multi-Head Attention
Cross-Attention
Attends to encoder output
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
🔄
Feed Forward
2048 hidden dim
Two linear layers with ReLU
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
🎭
Masked Multi-Head Attention
Self-Attention
Prevents looking at future tokens
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
🔗
Multi-Head Attention
Cross-Attention
Attends to encoder output
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
🔄
Feed Forward
2048 hidden dim
Two linear layers with ReLU
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
🎭
Masked Multi-Head Attention
Self-Attention
Prevents looking at future tokens
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
🔗
Multi-Head Attention
Cross-Attention
Attends to encoder output
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
🔄
Feed Forward
2048 hidden dim
Two linear layers with ReLU
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
🎭
Masked Multi-Head Attention
Self-Attention
Prevents looking at future tokens
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
🔗
Multi-Head Attention
Cross-Attention
Attends to encoder output
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
🔄
Feed Forward
2048 hidden dim
Two linear layers with ReLU
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
🎭
Masked Multi-Head Attention
Self-Attention
Prevents looking at future tokens
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
🔗
Multi-Head Attention
Cross-Attention
Attends to encoder output
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
🔄
Feed Forward
2048 hidden dim
Two linear layers with ReLU
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
🎭
Masked Multi-Head Attention
Self-Attention
Prevents looking at future tokens
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
🔗
Multi-Head Attention
Cross-Attention
Attends to encoder output
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
🔄
Feed Forward
2048 hidden dim
Two linear layers with ReLU
⚖️
Add & Norm
Residual + LayerNorm
Residual connection and normalization
🎯
Linear & Softmax
Vocabulary Size
Projects to vocabulary probabilities

Multi-Head Attention

The core mechanism that allows the model to focus on different positions simultaneously.

  • 8 attention heads in parallel
  • 64-dimensional head size
  • Scaled dot-product attention
  • Q, K, V linear projections

Feed Forward Network

Two linear transformations with ReLU activation in between.

  • Expansion to 2048 dimensions
  • ReLU activation function
  • Projection back to 512 dimensions
  • Applied to each position separately

Layer Normalization

Stabilizes training by normalizing inputs across features.

  • Applied after each sub-layer
  • Includes learnable parameters
  • Helps with gradient flow
  • Enables deeper networks

Residual Connections

Skip connections that help gradients flow through deep networks.

  • Add input to sub-layer output
  • Prevent vanishing gradients
  • Enable very deep models
  • Dropout applied before addition