RLHF (reinforcement learning from human feedback)

Definition: RLHF trains a model from human preferences: humans rate answers, and the model learns to produce the preferred ones.

It's a key step in aligning assistants. Anthropic complements it with Constitutional AI.

See also

← Full AI glossary · AI news