-
Evaluating the Paperclip Maximizer and Instrumental Goals
Investigating whether RL-based language models exhibit concerning instrumental goal-seeking behavior
-
Gradual Disempowerment and Systemic AI Risks
Examining how incremental AI development might lead to gradual but systemic disempowerment
-
Dynamic Normativity and Value Alignment
Exploring dynamic approaches to value alignment and necessary conditions for robust AI safety
-
Superalignment and Parallel Optimization
Examining arguments for immediate superalignment research through competence and conformity optimization
-
Emergent Misalignment in Language Models
Exploring how narrow finetuning can lead to broadly misaligned LLM behavior