Back to AIBriefs
AnalysisAI Models

Qwen 3.5 censorship identified as small circuit in weights

Mechanistic interpretability study of Qwen 3.5-9B reveals that political censorship is encoded in a small, identifiable circuit spanning layers 11-31. The base model retains factual knowledge on sensitive topics, with refusal behavior layered on top via specific internal directions.

··Discuss
28 days ago
Qwen 3.5 censorship identified as small circuit in weights — AIBriefs