Qwen 3.5 censorship identified as small circuit in weights

AnalysisAI Models

28 days ago

Qwen 3.5 censorship identified as small circuit in weights

vas-blog.pages.dev

Mechanistic interpretability study of Qwen 3.5-9B reveals that political censorship is encoded in a small, identifiable circuit spanning layers 11-31. The base model retains factual knowledge on sensitive topics, with refusal behavior layered on top via specific internal directions.

··Discuss

28 days ago