Date: December 17, 2025
The Machine Learning Security Laboratory MLSec is hosting its next seminar!
Join the online event on “A mechanistic view of refusal in language models” with Andy Arditi from the Northeastern University.

How to Register
Sign up via the official landing page.
Abstract
Refusal is currently the primary mechanism by which LLM developers prevent misuse: models are trained to decline harmful or inappropriate requests. We study this critical behavior through a mechanistic lens, asking how refusal is implemented internally, and we show that it is largely mediated by a single direction in activation space. This simple observation enables several applications, including “weight orthogonalization,” a cheap and effective method for disabling refusal guardrails in open-source models. It has also proven to be useful for implementing more efficient red teaming and adversarial training pipelines. In this talk, I will present these findings, discuss follow-up work by other groups, and outline the greater implications for open-source model development.
The Speaker: Andy Arditi
Andy is a PhD student at Northeastern University advised by David Bau. He researches mechanistic interpretability and the internal structure of large language models. His recent work focuses on applying mechanistic interpretability to practical problems, such as understanding model refusals, detecting hallucinations, and uncovering linear representations of persona in chat assistants. Before his PhD, he earned bachelor’s and master’s degrees from Columbia University and worked as a software engineer at Microsoft.
More About the MLSec Lab Series
The MLSec Laboratory is a research branch of the Pattern Recognition and Application Laboratory (PRALab) at the University of Cagliari (Italy). The topics investigated in our research are at the intersection of machine learning and computer security.
Learn more:

