The Transformer
Visualized
An interactive journey through the architecture that powers GPT, BERT, and modern AI. Explore attention mechanisms, embeddings, and more.
1. How Tokens Flow Through a Transformer
Watch as input tokens pass through multiple layers of the transformer, getting transformed at each step. Click play to animate the flow.
2. The Attention Mechanism
The core innovation of transformers. Each token "attends" to all other tokens, learning which relationships matter most. Hover over cells to see attention weights.
How to read this:
- • Rows = source token (where attention comes from)
- • Columns = target token (where attention goes to)
- • Brighter = stronger attention weight
Attention Parameters
Higher = more uniform attention, Lower = more focused
Multi-head attention allows learning different patterns
Attention Heatmap
3. Exploring the Embedding Space
Words become vectors in high-dimensional space. Similar words cluster together. Drag to rotate, scroll to zoom, click points to explore relationships.
Semantic Similarity
"cat" and "dog" are close because they're both animals. Verbs like "sat" and "ran" form their own cluster.
Vector Arithmetic
Famous example: king - man + woman ≈ queen. Relationships are encoded as directions in space.
Contextual Embeddings
Unlike static embeddings, transformers create context-dependent representations that change based on surrounding words.
4. Putting It All Together
Adjust the parameters below to see how they affect the model architecture. Each change updates the visualization in real-time.
Model Architecture
Deeper models can learn more complex patterns
Width of the hidden representations
Parallel attention mechanisms
Computed Architecture
GPT-3 has 175 billion parameters across 96 layers with d_model = 12,288. That's over 1000x larger than the configuration above!
Want to dive deeper?
Check out our other interactive explorations and tutorials.