Transformer Architecture in 3D

An Interactive Visualization of "Attention is All You Need"

Controls

Component Info

Hover over components to see details

Multi-Head Attention
Feed Forward
Layer Norm
Embeddings

Architecture Overview

The Transformer architecture revolutionized NLP by relying entirely on attention mechanisms, eliminating the need for recurrence and convolutions. This visualization shows the complete encoder-decoder architecture with all its components.

Key Dimensions

  • Model dimension (d_model): 512
  • Number of heads: 8
  • Head dimension: 64
  • Feed-forward dimension: 2048
  • Number of layers: 6

Multi-Head Attention

The core innovation of the Transformer, allowing the model to jointly attend to information from different representation subspaces at different positions.

Attention(Q,K,V) = softmax(QKT/√dk)V

Attention Types

  • Self-Attention: In encoder layers
  • Masked Self-Attention: In decoder layers
  • Cross-Attention: Decoder attending to encoder

Positional Encoding

Since the model contains no recurrence or convolution, positional encodings are added to give the model information about the relative or absolute position of tokens.

PE(pos,2i) = sin(pos/100002i/dmodel)
PE(pos,2i+1) = cos(pos/100002i/dmodel)

Layer Components

Sub-layers

  • Multi-Head Attention: Parallel attention computations
  • Feed Forward Network: Two linear transformations with ReLU
  • Layer Normalization: Applied after each sub-layer
  • Residual Connections: Around each sub-layer

Feed Forward Network

FFN(x) = max(0, xW1 + b1)W2 + b2

Interactive Features

Navigation

  • Orbit: Left mouse button + drag
  • Zoom: Mouse wheel or pinch
  • Pan: Right mouse button + drag

Visualization Modes

  • Full Architecture: Complete encoder-decoder
  • Attention Animation: Watch attention flow
  • Data Flow: Follow token processing

Implementation Notes

This visualization implements the architecture described in "Attention is All You Need" (Vaswani et al., 2017). The 3D representation allows for better understanding of the parallel nature of multi-head attention and the flow of information through the network.

Visual Design

  • Color coding follows TensorFlow conventions
  • Opacity indicates information flow intensity
  • Animations reveal computation sequence
  • Interactive controls for exploration