What is Mechanistic interpretability ?

Question

Accepted Answer

Mechanistic interpretability seeks to understand a model's inner workings, by identifying the circuits and representations that produce its behaviors. The goal is to open the 'black box' to explain, predict and correct what the model does. It is an important research direction for AI safety, to which Anthropic actively contributes.

Mechanistic interpretability

See also