Interactive Exploration

The Transformer
Visualized

An interactive journey through the architecture that powers GPT, BERT, and modern AI. Explore attention mechanisms, embeddings, and more.

6 Layers

8 Attention Heads

d_model = 512

1. How Tokens Flow Through a Transformer

Watch as input tokens pass through multiple layers of the transformer, getting transformed at each step. Click play to animate the flow.

Input: "The cat sat on the mat"

Layer: 1/6

Input

The

cat

sat

the

mat

Embedding

h₀

h₁

h₂

h₃

h₄

h₅

Self-Attention

Feed Forward

Self-Attention

Feed Forward

Output

The

cat

sat

the

mat

2. The Attention Mechanism

The core innovation of transformers. Each token "attends" to all other tokens, learning which relationships matter most. Hover over cells to see attention weights.

How to read this:

• Rows = source token (where attention comes from)
• Columns = target token (where attention goes to)
• Brighter = stronger attention weight

Attention Parameters

Softmax Temperature1.00

Higher = more uniform attention, Lower = more focused

Number of Heads8

Multi-head attention allows learning different patterns

Attention Heatmap

Layer: 0Head: 0

The

cat

sat

the

mat

The

0.92

0.37

0.36

0.42

cat

0.28

0.89

0.43

0.42

0.41

0.38

sat

0.22

0.35

0.88

0.47

0.45

0.42

0.18

0.31

0.42

0.91

0.51

0.48

the

0.16

0.28

0.39

0.49

0.95

0.54

mat

0.20

0.25

0.36

0.46

0.55

1.00

Low

High

3. Exploring the Embedding Space

Words become vectors in high-dimensional space. Similar words cluster together. Drag to rotate, scroll to zoom, click points to explore relationships.

article

noun

verb

preposition

Drag to rotate • Scroll to zoom • Click points to select

Semantic Similarity

"cat" and "dog" are close because they're both animals. Verbs like "sat" and "ran" form their own cluster.

Vector Arithmetic

Famous example: king - man + woman ≈ queen. Relationships are encoded as directions in space.

Contextual Embeddings

Unlike static embeddings, transformers create context-dependent representations that change based on surrounding words.

4. Putting It All Together

Adjust the parameters below to see how they affect the model architecture. Each change updates the visualization in real-time.

Model Architecture

Number of Layers6

Deeper models can learn more complex patterns

Model Dimension512

Width of the hidden representations

Attention Heads8

Parallel attention mechanisms

Computed Architecture

Total Parameters

6.3M

Head Dimension

FFN Hidden Size

2048

Attention FLOPs

O(n²d)

💡 Did you know?

GPT-3 has 175 billion parameters across 96 layers with d_model = 12,288. That's over 1000x larger than the configuration above!

Want to dive deeper?

Check out our other interactive explorations and tutorials.

Explore More Building LLM Apps

The TransformerVisualized

1. How Tokens Flow Through a Transformer

2. The Attention Mechanism

How to read this:

Attention Parameters

Attention Heatmap

3. Exploring the Embedding Space

Semantic Similarity

Vector Arithmetic

Contextual Embeddings

4. Putting It All Together

Model Architecture

Computed Architecture

Want to dive deeper?

The Transformer
Visualized