publications | mAI alignment lab

2026

IASEAI
AI Alignment Strategies from a Risk Perspective: Independent Safety Mechanisms or Shared Failures?

Leonard Dung, and Florian Mai

In IASEAI’26: International Association for Safe and Ethical AI Conference, Feb 2026

Abs Bib

AI alignment research aims to develop techniques to ensure that AI systems do not cause harm. However, every alignment technique has failure modes, which are conditions in which there is a non-negligible chance that the technique fails to provide safety. As a strategy for risk mitigation, the AI safety community has increasingly adopted a defense-in-depth framework: Conceding that there is no single technique which guarantees safety, defense-in-depth consists in having multiple redundant protections against safety failure, such that safety can be maintained even if some protections fail. However, the success of defense-in-depth depends on how (un)correlated failure modes are across alignment techniques. For example, if all techniques had the exact same failure modes, the defense-in-depth approach would provide no additional protection at all. In this paper, we analyze 7 representative alignment techniques and 7 failure modes to understand the extent to which they overlap. We then discuss our results’ implications for understanding the current level of risk and how to prioritize AI alignment research in the future.
@inproceedings{dung2026alignmentrisk, title = {AI Alignment Strategies from a Risk Perspective: Independent Safety Mechanisms or Shared Failures?}, author = {Dung, Leonard and Mai, Florian}, booktitle = {IASEAI'26: International Association for Safe and Ethical AI Conference}, year = {2026}, address = {Paris, France}, month = feb, eprint = {2510.11235}, archiveprefix = {arXiv}, primaryclass = {cs.AI}, }

2025

LM4UC
Pluralistic AI Alignment: A Cross-Cultural Pilot Survey

Khashayar Alavi, Lucie Flek, and Florian Mai

In Second Workshop on Language Models for Underserved Communities (LM4UC), Feb 2025

Abs Bib

Large Language Models are used globally but are often aligned to primarily Western values. To better understand the need for pluralistic alignment methods, this paper presents a pilot survey that investigates how end users from diverse cultural contexts perceive the representation of their values in AIs, their demand for models better aligned to their own values, and what tradeoffs they would accept for better alignment. Our study reveals clear cross-cultural variation, strong interest in culturally aware assistants, higher marginalization fears in some groups, and wide willingness to trade slight accuracy losses for better alignment. Our findings provide a foundation for a more comprehensive global survey.
@inproceedings{alavi2025pluralistic, title = {Pluralistic {AI} Alignment: A Cross-Cultural Pilot Survey}, author = {Alavi, Khashayar and Flek, Lucie and Mai, Florian}, booktitle = {Second Workshop on Language Models for Underserved Communities (LM4UC)}, year = {2025}, url = {https://openreview.net/forum?id=A9oz6qFlQ4}, }
EMNLP
Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

Mehdi Ali, Manuel Brack, Max Lübbering, and 16 more authors

In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), Feb 2025

Abs Bib

High-quality multilingual training data is essential for effectively pretraining large language models (LLMs). Yet, the availability of suitable open-source multilingual datasets remains limited. Existing state-of-the-art datasets mostly rely on heuristic filtering methods, restricting both their cross-lingual transferability and scalability. Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale while significantly reducing computational demands. JQL distills LLMs’ annotation capabilities into lightweight annotators based on pretrained multilingual embeddings. These models exhibit robust multilingual and cross-lingual performance, even for languages and scripts unseen during training. Evaluated empirically across 35 languages, the resulting annotation pipeline substantially outperforms current heuristic filtering methods like Fineweb2. JQL notably enhances downstream model training quality and increases data retention rates. Our research provides practical insights and valuable resources for multilingual data curation, raising the standards of multilingual dataset development.
@inproceedings{ali2025jql, title = {Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models}, author = {Ali, Mehdi and Brack, Manuel and L{\"u}bbering, Max and Wendt, Elias and Khan, Abbas Goher and Rutmann, Richard and Jude, Alex and Kraus, Maurice and Weber, Alexander Arno and Stollenwerk, Felix and Kacz{\'e}r, David and Mai, Florian and Flek, Lucie and Sifa, Rafet and Flores-Herr, Nicolas and K{\"o}hler, Joachim and Schramowski, Patrick and Fromm, Michael and Kersting, Kristian}, booktitle = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year = {2025}, eprint = {2505.22232}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, }
BiAlign
Superalignment with Dynamic Human Values

Florian Mai, David Kaczér, Nicholas Kluge Corrêa, and 1 more author

In ICLR 2025 Workshop on Bidirectional Human-AI Alignment, Feb 2025

Abs Bib

Two core challenges of alignment are 1) scalable oversight and 2) accounting for the dynamic nature of human values. While solutions like recursive reward modeling address 1), they do not simultaneously account for 2). We sketch a roadmap for a novel algorithmic framework that trains a superhuman reasoning model to decompose complex tasks into subtasks that are still amenable to human-level guidance. Our approach relies on what we call the part-to-complete generalization hypothesis, which states that the alignment of subtask solutions generalizes to the alignment of complete solutions. We advocate for the need to measure this generalization and propose ways to improve it in the future.
@inproceedings{mai2025superalignment, title = {Superalignment with Dynamic Human Values}, author = {Mai, Florian and Kacz{\'e}r, David and Corr{\^e}a, Nicholas Kluge and Flek, Lucie}, booktitle = {ICLR 2025 Workshop on Bidirectional Human-AI Alignment}, year = {2025}, url = {https://openreview.net/forum?id=WvB9hKKjSc}, eprint = {2503.13621}, archiveprefix = {arXiv}, primaryclass = {cs.AI}, }
arXiv
In-Training Defenses against Emergent Misalignment in Language Models

David Kaczér, Magnus Jørgenvåg, Clemens Vetter, and 2 more authors

Feb 2025

Abs Bib

Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API. We investigate four training regularization interventions: (i) KL-divergence regularization toward a safe reference model, (ii) \ell_2 distance in feature space, (iii) projecting onto a safe subspace (SafeLoRA), and (iv) interleaving of a small amount of safe training examples from a general instruct-tuning dataset. We first evaluate the methods’ emergent misalignment effect across four malicious, EMA-inducing tasks. Second, we assess the methods’ impacts on benign tasks. We conclude with a discussion of open questions in emergent misalignment research.
@misc{kaczer2025intraining, title = {In-Training Defenses against Emergent Misalignment in Language Models}, author = {Kacz{\'e}r, David and J{\o}rgenv{\aa}g, Magnus and Vetter, Clemens and Flek, Lucie and Mai, Florian}, year = {2025}, eprint = {2508.06249}, archiveprefix = {arXiv}, primaryclass = {cs.LG}, }
arXiv
Survey-to-Behavior: Downstream Alignment of Human Values in LLMs via Survey Questions

Shangrui Nie, Florian Mai, David Kaczér, and 3 more authors

Feb 2025

Abs Bib

Large language models implicitly encode preferences over human values, yet steering them often requires large training data. In this work, we investigate a simple approach: Can we reliably modify a model’s value system in downstream behavior by training it to answer value survey questions accordingly? We first construct value profiles of several open-source LLMs by asking them to rate a series of value-related descriptions spanning 20 distinct human values, which we use as a baseline for subsequent experiments. We then investigate whether the value system of a model can be governed by fine-tuning on the value surveys. We evaluate the effect of finetuning on the model’s behavior in two ways; first, we assess how answers change on in-domain, held-out survey questions. Second, we evaluate whether the model’s behavior changes in out-of-domain settings (situational scenarios). To this end, we construct a contextualized moral judgment dataset based on Reddit posts and evaluate changes in the model’s behavior in text-based adventure games. We demonstrate that our simple approach can not only change the model’s answers to in-domain survey questions, but also produces substantial shifts (value alignment) in implicit downstream task behavior.
@misc{nie2025surveybehavior, title = {Survey-to-Behavior: Downstream Alignment of Human Values in LLMs via Survey Questions}, author = {Nie, Shangrui and Mai, Florian and Kacz{\'e}r, David and Welch, Charles and Zhao, Zhixue and Flek, Lucie}, year = {2025}, eprint = {2508.11414}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, }

2024

COLM
Learning to Plan for Language Modeling from Unlabeled Data

Nathan Cornille, Marie-Francine Moens, and Florian Mai

In First Conference on Language Modeling, Feb 2024

Abs Bib

By training to predict the next token in an unlabeled corpus, large language models learn to perform many tasks without any labeled data. However, their next-token-prediction objective arguably limits their performance in scenarios that require planning, such as writing a coherent article. In this paper, we train a module for planning the future writing process via a self-supervised learning objective. Given the textual context, this planning module learns to predict future abstract writing actions, which correspond to centroids in a clustered text embedding space. By conditioning on these actions, our model extends the successful language model formula to more abstract planning in an unsupervised way. Empirically, we demonstrate that our method improves language modeling performance in general, particularly with respect to the text structure. Because our framework uses a planner module that is unsupervised and external to the language model, new planner modules can be trained at large scale and easily be shared with the community.
@inproceedings{cornille2024learning, title = {Learning to Plan for Language Modeling from Unlabeled Data}, author = {Cornille, Nathan and Moens, Marie-Francine and Mai, Florian}, booktitle = {First Conference on Language Modeling}, year = {2024}, url = {https://openreview.net/forum?id=nT6fQIidrQ}, eprint = {2404.00614}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, }
WiNLP
Improving Language Modeling by Increasing Test-time Planning Compute

Florian Mai, Nathan Cornille, and Marie-Francine Moens

In Eighth Widening NLP Workshop (WiNLP 2024) Phase II, Feb 2024

Abs Bib

Modern language models predict the next token in the sequence by considering the past text through a powerful function. However, language models have no explicit mechanism that allows them to spend computation time for planning long-distance future text, leading to a suboptimal token prediction. In this paper, we propose a planner that predicts a latent plan for many sentences into the future. By sampling multiple plans at once, we condition the language model on an accurate approximation of the distribution of text continuations, which leads to better next token prediction accuracy. In effect, this allows trading computation time for prediction accuracy.
@inproceedings{mai2024improving, title = {Improving Language Modeling by Increasing Test-time Planning Compute}, author = {Mai, Florian and Cornille, Nathan and Moens, Marie-Francine}, booktitle = {Eighth Widening NLP Workshop (WiNLP 2024) Phase II}, year = {2024}, url = {https://openreview.net/forum?id=S3yyjW9OSY}, }