Tracing the thoughts of an LLM

Vinod
Generative AI Expert

Vinod

How does a Large Language Model produce one token?

Vinod

LLM working

Vinod

LLM Intrep

Vinod

Language model neurons are polysemantic

Vinod

But combination of neurons can be interpretable

bg1

bg1

Vinod

Feature : Golden Gate Bridge

Vinod

gg_bridge example

Vinod

Influence on Behavior

Vinod

influence example

Vinod

Vinod

Abstract Features

  • Sycophantic phrase
  • Secrecy
  • Code error
  • Bias
  • Deception
  • Power-seeking
  • Criminal
  • ...
Vinod

trump

Vinod

LLM: What’s Misunderstood vs. What’s True

Vinod

Misconception #1: LLM Simply Predicts the Next Word

Vinod

mis1

Vinod

Reality: LLMs Plans Ahead

Vinod

Misconception #2: LLM Processes Different Languages Separately

Vinod

mis2

Vinod

Reality: LLMs Uses Universal Concepts

Vinod

Misconception #3: LLM Reasoning Matches its Explanations

Vinod

mis3_1

Vinod

mis3 fit

Vinod

Reality: LLMs Internal Process Differs from its Explanations

Vinod

Misconception #4: LLMs Just Memorize Answers

Vinod

mis4

Vinod

Reality: LLMs Uses Multi-Step Reasoning

Vinod

Misconception #5: Hallucinations and Jailbreaks are Random Failures

Vinod

mis5_1

Vinod

mis5_2

Vinod

Reality: They’re the Result of Specific, Understandable Mechanisms

Vinod

How Researchers Proved These Findings

  • Attribution Graphs : Mapping computational pathways in Claude’s reasoning processes by grouping related neural features into interpretable steps.

  • Intervention Experiments: Measuring output changes when specific features were inhibited or activated.

  • Cross-layer Transcoders: Decomposing neural activity into sparse features to link concepts across model layers.

Read the full paper: https://transformer-circuits.pub/2025/attribution-graphs/biology.html

Vinod

Final Thought

  • Interpretable features in LLMs are highly abstract and capable of influencing large models.

  • LLMs can plan ahead, navigate meaning across languages, and sometimes generate explanations that diverge from their actual reasoning.

  • Understanding LLM is foundational for safety, trust, and impactful applications.

Vinod

People understand very little about how LLMs actually work, so they still think LLMs are very different from us. But actually, it's very important for people to understand that they're very like us.

Vinod

References

  1. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
  2. Mapping the Mind of a Large Language Model \ Anthropic
  3. Tracing the thoughts of a LLM
  4. Mechanistic Interpretability: A Look Inside an AI's Mind + The Latest AI Research from Anthropic
  5. Mechanistic Interpretability explained | Chris Olah and Lex Fridman
  6. Inside the Mind of Claude: How Large Language Models Actually "Think"
Vinod