Dara

← Back

Luc Raszewski

Mode Collapse Robust LLMs

What:

This project aims to reduce the susceptibility of LLMs to mode collapse, the narrowing of output diversity across semantic, stylistic and epistemic dimensions due to alignment post-training. We will develop new methods to detect when mode collapse occurs, introduce a new alignment approach that focuses on preserving diversity, and define a framework for balancing safety and diversity.

Why:

As more people rely on LLMs, their declining diversity risks knowledge collapse, where the information different users receive converges, eroding the depth and diversity of human thought. This problem compounds over LLM generations, and since LLMs shape the ideas of their users, ensuring they reflect a diversity of opinion is imperative to preventing quieter perspectives from being left behind.

How:

First, we will investigate whether the entropy of a model's hidden states can be used to monitor and diagnose mode collapse during training. Then, we will design a post-training method that adversarially identifies collapsing regions and targets them. Finally, we will quantify the trade-off between diversity and safety against existing benchmarks to find an informed balance.

‍

DARA FOOTPRINT

No items found.

Luc Raszewski

DARA FOOTPRINT

Subscribe to our bi-monthly newsletter