Vinod Generative AI Expert
Attribution Graphs : Mapping computational pathways in Claude’s reasoning processes by grouping related neural features into interpretable steps.
Intervention Experiments: Measuring output changes when specific features were inhibited or activated.
Cross-layer Transcoders: Decomposing neural activity into sparse features to link concepts across model layers.
Read the full paper: https://transformer-circuits.pub/2025/attribution-graphs/biology.html
Interpretable features in LLMs are highly abstract and capable of influencing large models.
LLMs can plan ahead, navigate meaning across languages, and sometimes generate explanations that diverge from their actual reasoning.
Understanding LLM is foundational for safety, trust, and impactful applications.
People understand very little about how LLMs actually work, so they still think LLMs are very different from us. But actually, it's very important for people to understand that they're very like us.