Architecture Overview
The Transformer architecture revolutionized NLP by relying entirely on attention mechanisms, eliminating the need for recurrence and convolutions. This visualization shows the complete encoder-decoder architecture with all its components.
Key Dimensions
- Model dimension (d_model): 512
- Number of heads: 8
- Head dimension: 64
- Feed-forward dimension: 2048
- Number of layers: 6
Multi-Head Attention
The core innovation of the Transformer, allowing the model to jointly attend to information from different representation subspaces at different positions.
Attention Types
- Self-Attention: In encoder layers
- Masked Self-Attention: In decoder layers
- Cross-Attention: Decoder attending to encoder
Positional Encoding
Since the model contains no recurrence or convolution, positional encodings are added to give the model information about the relative or absolute position of tokens.
PE(pos,2i+1) = cos(pos/100002i/dmodel)
Layer Components
Sub-layers
- Multi-Head Attention: Parallel attention computations
- Feed Forward Network: Two linear transformations with ReLU
- Layer Normalization: Applied after each sub-layer
- Residual Connections: Around each sub-layer
Feed Forward Network
Interactive Features
Navigation
- Orbit: Left mouse button + drag
- Zoom: Mouse wheel or pinch
- Pan: Right mouse button + drag
Visualization Modes
- Full Architecture: Complete encoder-decoder
- Attention Animation: Watch attention flow
- Data Flow: Follow token processing
Implementation Notes
This visualization implements the architecture described in "Attention is All You Need" (Vaswani et al., 2017). The 3D representation allows for better understanding of the parallel nature of multi-head attention and the flow of information through the network.
Visual Design
- Color coding follows TensorFlow conventions
- Opacity indicates information flow intensity
- Animations reveal computation sequence
- Interactive controls for exploration