Tracing the Thoughts of LLMs

Attribution Graphs : Mapping computational pathways in Claude’s reasoning processes by grouping related neural features into interpretable steps.
Intervention Experiments: Measuring output changes when specific features were inhibited or activated.
Cross-layer Transcoders: Decomposing neural activity into sparse features to link concepts across model layers.

Tracing the thoughts of an LLM