General Information

The workshop on Actionable Interpretability@ICML2025 aims to foster discussions on leveraging interpretability insights to drive tangible advancements in AI across diverse domains. We welcome contributions that move beyond theoretical analysis, demonstrating concrete improvements in model alignment, robustness, and real-world applications. Additionally, we seek to explore the challenges inherent in translating interpretability research into actionable impact.

Outstanding Papers

We are happy to announce that the Outstanding Paper Awards goes to:

Detecting High-Stakes Interactions with Activation Probes by Alex McKenzie, Phil Blandfort, Urja Pawar, William Bankes, David Krueger, Ekdeep Singh Lubana, Dmitrii Krasheninnikov and
Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations by Pedro Lobato Ferreira, Wilker Aziz, Ivan Titov.

Congratulations to the authors for their exceptional work!

News

July 03 2025: Schedule for the workshop published!
July 03 2025: Information for authors added.
June 19 2025: Acceptance notifications are published!
May 12 2025: The workshop is scheduled for July 19!
May 12 2025: Clarification: Submissions to the conference track may include the camera-ready version of the accepted paper (up to 9 pages, do not need to be anonymized).
May 3 2025: The submission deadline has been extended to May 19.
April 15 2025: Our submissions page on OpenReview is open!
March 31 2025: Call for papers published!
March 19 2025: Our Workshop was accepted to ICML!

Information for Authors

Since the AIW workshop is non-archival, there is no need to submit a camera-ready version. For the same reason, papers and reviews on OpenReview will not be made public. We will list all the accepted papers’ titles & authors (alongside your poster PDFs and optional video recordings) on our website, but we will not link the PDFs. If you would like your paper to be public, we recommend hosting it on your personal website or on arXiv.

Important Dates

May 19 - Submissions due (extended)

June 19 - Acceptance notification

July 19 - Workshop day

(All dates are Anywhere On Earth.)

Schedule

From	Until
08:00	09:00	Poster Setup 1
09:00	09:10	Opening Remarks
09:10	09:40	Keynote - Been Kim - Agentic Interpretability and Neologism: what LLMs can offer us
09:40	10:10	Keynote - Sarah Schwettmann - AI Investigators for Understanding AI Systems
10:10	10:25	Talk - Detecting High-Stakes Interactions with Activation Probes
10:25	10:40	Talk - Actionable Interpretability with NDIF and NNsight
10:40	11:40	Poster Session 1
11:40	13:00	Lunch + Poster Setup 2
13:00	14:00	Poster Session 2
14:00	14:30	Keynote - Byron Wallace - What (if anything) can interpretability do for healthcare?
14:30	14:45	Talk - Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations
14:45	15:00	Talk - Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors
15:00	15:30	Coffee Break
15:30	16:00	Keynote - Eric Wong - Explanations for Experts via Guarantees and Domain Knowledge: From Attributions to Reasoning
16:00	16:45	Panel Discussion
16:45	17:00	Closing Remarks

Invited Speakers

Been Kim

Senior Staff Research Scientist, Google DeepMind

Agentic Interpretability and Neologism: what LLMs can offer us

Sarah Schwettmann

Co-Founder, Transluce; Research Scientist, MIT CSAIL

AI Investigators for Understanding AI Systems

Byron Wallace

Associate Professor, Northeastern University

What (if anything) can interpretability do for healthcare?

LLMs are poised to reshape healthcare, offering the possibility of delivering better care at scale. But the blackbox nature of such models brings real risks. Deployed naively, such models may worsen existing biases and exploit spurious correlations; ultimately this may harm patient care. Emerging (“mechanistic”) interpretability methods promise to make such models more transparent, but the degree to which such methods might offer actionable insights in realistic, domain-specific tasks is unclear. In this talk I’ll discuss some applications of recent interpretability techniques in the context of healthcare, highlighting their potential as well as some current limitations.

Eric Wong

Assistant Professor, University of Pennsylvania

Explanations for Experts via Guarantees and Domain Knowledge: From Attributions to Reasoning

“Build it and they will come.” After years of research on interpreting ML models, why have domain experts largely stayed away? A major obstacle is one of translation: experts don’t understand what to do with ML explanations, as the exact interpretation is often unclear and fails to align with how an expert thinks. This talk introduces two lines of research to make explanations accessible to experts. First, we introduce explanations with certified guarantees for mathematically precise and unambiguous interpretations. Second, we develop benchmarks to quantify the alignment of these explanations with expert knowledge, creating a way to evaluate if they make sense in an expert’s domain language. We demonstrate our techniques across applications in healthcare, astrophysics, and psychology, for explanations ranging from classic feature attributions to LLM chain-of-thought reasoning.

Panelists

Naomi Saphra

Research Fellow, Kempner Institute, Harvard University

Sam Marks

Technical Staff, Anthropic

Kyle Lo

Research Scientist, Ai2

Fazl Barez

Senior Research Fellow, University of Oxford

Organizers

Tal Haklay

PhD student, Technion

Hadas Orgad

PhD student, Technion

Anja Reusch

Postdoc, Technion

Marius Mosbach

Postdoc, McGill University and Mila – Quebec AI Institute

Sarah Wiegreffe

Postdoc, Allen Institute for AI (Ai2) and University of Washington

Ian Tenney

Staff Research Scientist, Google DeepMind

Mor Geva

Assistant Professor, Tel Aviv University; Research Scientist, Google Research

Reviewers

We would like to thank our reviewers for their valuable insights and support, which made this workshop possible. Your contributions are greatly appreciated:

Aaron Mueller, Aaron T Parisi, Ada Defne Tur, Adi Simhi, Adir Rahamim, Agam Goyal, Alessandro Stolfo, Andrew Parry, Arian Khorasani, Aruna Sankaranarayanan, Benno Krojer, Bitya Neuhof, Canyu Chen, Catherine Chen, Changhun Kim, Chunyuan Deng, Churan Zhi, Clément Dumas, Dana Arad, Daniela Gottesman, Danil Akhtiamov, Dheeraj Rajagopal, Di Wu, Duncan McClements, Emily Reif, Eric Todd, EunJeong Hwang, Gabriel Kasmi, Gintare Karolina Dziugaite, Gurmeet Saran, Guy Kaplan, Hakaze Cho, Hosein Mohebbi, Hubert Baniecki, Isha Chaudhary, Ishaan Malhi, Itay Itzhak, Itay Yona, Iván Arcuschin, Jakob Krebs, James Wexler, Jannik Brinkmann, Javier Ferrando, Jing Huang, Joachim Studnia, Julian Rodemann, Julius Gonsior, Koyena Pal, Krishna Kanth Nakka, Kushal Thakkar, Liang Yan, Maheep Chaudhary, Mani Malek, Marius Mosbach, Martin Tutek, Md Abrar Jahin, Mehrsa Pourya, Michael A. Hedderich, Michael Hanna, Michael Toker, Mohammad Jalali, Mohammad Taufeeque, Myungjoon Kim, Natalie Shapira, Naveen Janaki Raman, Neta Glazer, Nikhil Prakash, Niloofar Azizi, Nils Palumbo, Nischal Reddy Chandra, Nishant Suresh Aswani, Nishit Anand, Pattarawat Chormai, Pepa Atanasova, Peter Chen, Pratinav Seth, Renjie Cao, Riccardo Renzulli, Rishikesh Jha, Rynaa Grover, Sahiti Yerramilli, Samuel Pfrommer, Sanghamitra Dutta, Sewoong Lee, Sheridan Feucht, Shiqi Chen, Siddharth Mishra-Sharma, Sigurd Schacht, Simon Ostermann, Simone Piaggesi, Sohee Yang, Steve Azzolin, Susanne Dandl, Tamar Rott Shaham, Tanja Baeumel, Ting-Yun Chang, Tomás Vergara Browne, Varun Gumma, William Saunders, Xavier Thomas, Xiang Pan, Xiantao Zhang, Yaniv Nikankin, Yanna Ding, Yisong Miao, Yoav Gur-Arieh, Zecheng Zhang

Outstanding Papers

News

Information for Authors

Important Dates

Schedule

Invited Speakers

Been Kim

Agentic Interpretability and Neologism: what LLMs can offer us

Sarah Schwettmann

AI Investigators for Understanding AI Systems

Byron Wallace

What (if anything) can interpretability do for healthcare?

Eric Wong

Explanations for Experts via Guarantees and Domain Knowledge: From Attributions to Reasoning

Panelists

Naomi Saphra

Research Fellow, Kempner Institute, Harvard University

Sam Marks

Technical Staff, Anthropic

Kyle Lo

Research Scientist, Ai2

Fazl Barez

Senior Research Fellow, University of Oxford

Organizers

Tal Haklay

PhD student, Technion

Hadas Orgad

PhD student, Technion

Anja Reusch

Postdoc, Technion

Marius Mosbach

Postdoc, McGill University and Mila – Quebec AI Institute

Sarah Wiegreffe

Postdoc, Allen Institute for AI (Ai2) and University of Washington

Ian Tenney

Staff Research Scientist, Google DeepMind

Mor Geva

Assistant Professor, Tel Aviv University; Research Scientist, Google Research

Reviewers

Sponsors