LiveFR

Mechanistic interpretability

Definition: Mechanistic interpretability seeks to understand a model's inner workings, by identifying the circuits and representations that produce its behaviors.

The goal is to open the 'black box' to explain, predict and correct what the model does. It is an important research direction for AI safety, to which Anthropic actively contributes.

See also

← Full AI glossary · AI news