AnalysisAI Models
28 days ago
Qwen 3.5 censorship identified as small circuit in weights
Mechanistic interpretability study of Qwen 3.5-9B reveals that political censorship is encoded in a small, identifiable circuit spanning layers 11-31. The base model retains factual knowledge on sensitive topics, with refusal behavior layered on top via specific internal directions.