Value Alignment

how do we make language models follow specified human value profiles?

Synopsis

Value Alignment tackles the challenge of steering language models so their behaviour follows a specified human value profile, such as one defined by the Schwartz Value Theory. A core question is whether these values manifest consistently in out-of-domain evaluations. Our current work explores a simple approach: fine-tuning models on value survey responses to induce the desired value system, as introduced in our Survey-to-Behavior preprint.

Example of a value profile aligned through survey-based fine-tuning.

Outputs

  • Preprint: Survey-to-Behavior: Downstream Alignment of Human Values in LLMs via Survey Questions — arXiv (paper)