Mechanistic Interpretability

CLUE: Conflict-guided Localization for LLM Unlearning Framework featured image

CLUE: Conflict-guided Localization for LLM Unlearning Framework

The LLM unlearning aims to eliminate the influence of undesirable data without affecting causally unrelated information. This process typically involves using a forget set to …

hang-chen-jiaying-zhu-xinyu-yang-wenya-wang
Skill Path: Unveiling Language Skills from Circuit Graphs featured image

Skill Path: Unveiling Language Skills from Circuit Graphs

Circuit graph discovery has emerged as a fundamental approach to elucidating the skill mechanistic of language models. Despite the output faithfulness of circuit graphs, they …

hang-chen-xinyu-yang-jiaying-zhu-wenya-wang
Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates featured image

Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates

Circuit discovery has gradually become one of the prominent methods for mechanistic interpretability, and research on circuit completeness has also garnered increasing attention. …

hang-chen-jiaying-zhu-xinyu-yang-wenya-wang