Encoder Stack
Positional Encoding
Sinusoidal
Adds position information using sine/cosine functions
Multi-Head Attention
Self-Attention
8 heads attending to all positions
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Feed Forward
2048 hidden dim
Two linear layers with ReLU
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Multi-Head Attention
Self-Attention
8 heads attending to all positions
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Feed Forward
2048 hidden dim
Two linear layers with ReLU
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Multi-Head Attention
Self-Attention
8 heads attending to all positions
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Feed Forward
2048 hidden dim
Two linear layers with ReLU
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Multi-Head Attention
Self-Attention
8 heads attending to all positions
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Feed Forward
2048 hidden dim
Two linear layers with ReLU
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Multi-Head Attention
Self-Attention
8 heads attending to all positions
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Feed Forward
2048 hidden dim
Two linear layers with ReLU
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Multi-Head Attention
Self-Attention
8 heads attending to all positions
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Feed Forward
2048 hidden dim
Two linear layers with ReLU
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Decoder Stack
Positional Encoding
Sinusoidal
Adds position information to output embeddings
Masked Multi-Head Attention
Self-Attention
Prevents looking at future tokens
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Multi-Head Attention
Cross-Attention
Attends to encoder output
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Feed Forward
2048 hidden dim
Two linear layers with ReLU
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Masked Multi-Head Attention
Self-Attention
Prevents looking at future tokens
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Multi-Head Attention
Cross-Attention
Attends to encoder output
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Feed Forward
2048 hidden dim
Two linear layers with ReLU
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Masked Multi-Head Attention
Self-Attention
Prevents looking at future tokens
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Multi-Head Attention
Cross-Attention
Attends to encoder output
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Feed Forward
2048 hidden dim
Two linear layers with ReLU
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Masked Multi-Head Attention
Self-Attention
Prevents looking at future tokens
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Multi-Head Attention
Cross-Attention
Attends to encoder output
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Feed Forward
2048 hidden dim
Two linear layers with ReLU
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Masked Multi-Head Attention
Self-Attention
Prevents looking at future tokens
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Multi-Head Attention
Cross-Attention
Attends to encoder output
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Feed Forward
2048 hidden dim
Two linear layers with ReLU
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Masked Multi-Head Attention
Self-Attention
Prevents looking at future tokens
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Multi-Head Attention
Cross-Attention
Attends to encoder output
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Feed Forward
2048 hidden dim
Two linear layers with ReLU
Add & Norm
Residual + LayerNorm
Residual connection and normalization
Linear & Softmax
Vocabulary Size
Projects to vocabulary probabilities
Multi-Head Attention
The core mechanism that allows the model to focus on different positions simultaneously.
- 8 attention heads in parallel
- 64-dimensional head size
- Scaled dot-product attention
- Q, K, V linear projections
Feed Forward Network
Two linear transformations with ReLU activation in between.
- Expansion to 2048 dimensions
- ReLU activation function
- Projection back to 512 dimensions
- Applied to each position separately
Layer Normalization
Stabilizes training by normalizing inputs across features.
- Applied after each sub-layer
- Includes learnable parameters
- Helps with gradient flow
- Enables deeper networks
Residual Connections
Skip connections that help gradients flow through deep networks.
- Add input to sub-layer output
- Prevent vanishing gradients
- Enable very deep models
- Dropout applied before addition