Back to Blog
Interactive Exploration

The Transformer
Visualized

An interactive journey through the architecture that powers GPT, BERT, and modern AI. Explore attention mechanisms, embeddings, and more.

6 Layers
8 Attention Heads
d_model = 512

1. How Tokens Flow Through a Transformer

Watch as input tokens pass through multiple layers of the transformer, getting transformed at each step. Click play to animate the flow.

Input: "The cat sat on the mat"
Layer: 1/6
Input
The
cat
sat
on
the
mat
Embedding
h0
h1
h2
h3
h4
h5
Self-Attention
Feed Forward
Self-Attention
Feed Forward
Output
Output
The
cat
sat
on
the
mat

2. The Attention Mechanism

The core innovation of transformers. Each token "attends" to all other tokens, learning which relationships matter most. Hover over cells to see attention weights.

How to read this:

  • • Rows = source token (where attention comes from)
  • • Columns = target token (where attention goes to)
  • • Brighter = stronger attention weight

Attention Parameters

1.00

Higher = more uniform attention, Lower = more focused

8

Multi-head attention allows learning different patterns

Attention Heatmap

Layer: 0Head: 0
The
cat
sat
on
the
mat
The
0.92
0.37
0.37
0.37
0.36
0.42
cat
0.28
0.89
0.43
0.42
0.41
0.38
sat
0.22
0.35
0.88
0.47
0.45
0.42
on
0.18
0.31
0.42
0.91
0.51
0.48
the
0.16
0.28
0.39
0.49
0.95
0.54
mat
0.20
0.25
0.36
0.46
0.55
1.00
Low
High

3. Exploring the Embedding Space

Words become vectors in high-dimensional space. Similar words cluster together. Drag to rotate, scroll to zoom, click points to explore relationships.

ransatonThethematcatdogXYZ
article
noun
verb
preposition
Drag to rotate • Scroll to zoom • Click points to select

Semantic Similarity

"cat" and "dog" are close because they're both animals. Verbs like "sat" and "ran" form their own cluster.

Vector Arithmetic

Famous example: king - man + woman ≈ queen. Relationships are encoded as directions in space.

Contextual Embeddings

Unlike static embeddings, transformers create context-dependent representations that change based on surrounding words.

4. Putting It All Together

Adjust the parameters below to see how they affect the model architecture. Each change updates the visualization in real-time.

Model Architecture

6

Deeper models can learn more complex patterns

512

Width of the hidden representations

8

Parallel attention mechanisms

Computed Architecture

Total Parameters
6.3M
Head Dimension
64
FFN Hidden Size
2048
Attention FLOPs
O(n²d)
💡 Did you know?

GPT-3 has 175 billion parameters across 96 layers with d_model = 12,288. That's over 1000x larger than the configuration above!

Want to dive deeper?

Check out our other interactive explorations and tutorials.