mAI alignment lab

About

Welcome to the mAI alignment lab, a Junior Research Group at the University of Bonn led by Dr. Florian Mai.

Our research focuses on AI alignment and safety issues, exploring how to ensure that current and future advanced AI systems are acting reliably in accordance with human values.

Research Interests

Scalable Oversight
Value alignment
Emergent misalignment
Reasoning Models
LLM training

Current Projects

Scalable Oversight by Learning to Decompose Tasks: Exploring how AI systems can learn to break complex tasks into manageable subtasks for reliable human oversight, advancing the frontier of superalignment research.
Emergent Misalignment: Investigating how narrow finetuning can produce broadly misaligned language models and developing methods to prevent such misalignment.
Value Alignment: Researching methods to ensure AI systems align with human values and preferences.

Join Us

We currently have no open positions available. However, if you are interested in collaborating with our research group, please feel free to send an email to Dr. Florian Mai at fmai@uni-bonn.de.

news

Jan 8, 2026	Our paper “AI Alignment Strategies from a Risk Perspective: Independent Safety Mechanisms or Shared Failures?” will be presented at IASEAI’26. Read the paper on arXiv.
Dec 14, 2025	Our workshop paper “Pluralistic AI Alignment: A Cross-Cultural Pilot Survey” will be presented at the Second Workshop on Language Models for Underserved Communities (LM4UC).
Oct 27, 2025	The AI alignment lab at Uni Bonn has started! Learn more on the course page.
Oct 13, 2025	New preprint! Leonard Dung and Florian Mai analyze AI alignment strategies from a risk perspective and compare overlaps in failure modes across alignment techniques. Read the preprint on arXiv.
Aug 21, 2025	Our JQL paper has been accepted at EMNLP 2025! Read the preprint on arXiv.
Aug 15, 2025	New preprint! Survey-to-Behavior aligns language models with human values using survey questions. Check it out on arXiv.
Aug 8, 2025	New preprint! We explore in-training defenses against emergent misalignment in language models. Check it out on arXiv.
May 31, 2025	Our conference on AI risk has successfully concluded! The event featured insightful discussions, brilliant keynote speakers including Yoshua Bengio and Iason Gabriel, and engaging talks on critical AI safety topics. The conference received coverage from Belgian media, including De Standaard and De Tijd. Hope to see you next year!
May 28, 2025	New preprint! We introduce JQL, a systematic approach for multilingual data curation that outperforms existing filtering methods. Check it out on arXiv.
Mar 30, 2025	We’re giving a seminar course about the ethics of Artificial General Intelligence in the summer semester! The course covers AGI basics, alignment and value specification, control and autonomy, systemic risks, and global governance. Learn more and register here.
Mar 23, 2025	Registrations are now open for the International Conference on Large-Scale AI Risks from 26-28th May 2025 in Leuven, Belgium. Dr. Florian Mai helped organize this event and we look forward to seeing you there!
Mar 20, 2025	Dr. Florian Mai is participating in a panel discussion on trustworthy AI at the Deutsches Museum Bonn.
Mar 6, 2025	Great news! Dr. Florian Mai and collaborators’ paper “Superalignment with Dynamic Human Values” was accepted at the BiAlign Workshop at ICLR 2025!
Feb 19, 2025	The mAI alignment lab started an AI safety reading group at University of Bonn, discussing recent papers on alignment and more! Interested in joining? Subscribe to our mailing list!
Jan 1, 2025	The mAI alignment lab is founded! Dr. Florian Mai started as a Junior Research Group Leader at University of Bonn as part of the CAISA lab headed by Prof. Lucie Flek. The lab’s research will focus on AI safety topics like value alignment, and on reasoning and planning approaches for LLMs.

safety reading group

May 21, 2025	Evaluating the Paperclip Maximizer and Instrumental Goals
May 7, 2025	Gradual Disempowerment and Systemic AI Risks
Apr 9, 2025	Dynamic Normativity and Value Alignment

selected publications

IASEAI
AI Alignment Strategies from a Risk Perspective: Independent Safety Mechanisms or Shared Failures?

Leonard Dung, and Florian Mai

In IASEAI’26: International Association for Safe and Ethical AI Conference, Feb 2026

Abs Bib

AI alignment research aims to develop techniques to ensure that AI systems do not cause harm. However, every alignment technique has failure modes, which are conditions in which there is a non-negligible chance that the technique fails to provide safety. As a strategy for risk mitigation, the AI safety community has increasingly adopted a defense-in-depth framework: Conceding that there is no single technique which guarantees safety, defense-in-depth consists in having multiple redundant protections against safety failure, such that safety can be maintained even if some protections fail. However, the success of defense-in-depth depends on how (un)correlated failure modes are across alignment techniques. For example, if all techniques had the exact same failure modes, the defense-in-depth approach would provide no additional protection at all. In this paper, we analyze 7 representative alignment techniques and 7 failure modes to understand the extent to which they overlap. We then discuss our results’ implications for understanding the current level of risk and how to prioritize AI alignment research in the future.
@inproceedings{dung2026alignmentrisk, title = {AI Alignment Strategies from a Risk Perspective: Independent Safety Mechanisms or Shared Failures?}, author = {Dung, Leonard and Mai, Florian}, booktitle = {IASEAI'26: International Association for Safe and Ethical AI Conference}, year = {2026}, address = {Paris, France}, month = feb, eprint = {2510.11235}, archiveprefix = {arXiv}, primaryclass = {cs.AI}, }
COLM
Learning to Plan for Language Modeling from Unlabeled Data

Nathan Cornille, Marie-Francine Moens, and Florian Mai

In First Conference on Language Modeling, Feb 2024

Abs Bib

By training to predict the next token in an unlabeled corpus, large language models learn to perform many tasks without any labeled data. However, their next-token-prediction objective arguably limits their performance in scenarios that require planning, such as writing a coherent article. In this paper, we train a module for planning the future writing process via a self-supervised learning objective. Given the textual context, this planning module learns to predict future abstract writing actions, which correspond to centroids in a clustered text embedding space. By conditioning on these actions, our model extends the successful language model formula to more abstract planning in an unsupervised way. Empirically, we demonstrate that our method improves language modeling performance in general, particularly with respect to the text structure. Because our framework uses a planner module that is unsupervised and external to the language model, new planner modules can be trained at large scale and easily be shared with the community.
@inproceedings{cornille2024learning, title = {Learning to Plan for Language Modeling from Unlabeled Data}, author = {Cornille, Nathan and Moens, Marie-Francine and Mai, Florian}, booktitle = {First Conference on Language Modeling}, year = {2024}, url = {https://openreview.net/forum?id=nT6fQIidrQ}, eprint = {2404.00614}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, }