AnalysisAI Models
8 days ago
Paper proposes better Activation Oracles for LLM interpretability
Activation Oracles (AOs) for interpreting residual stream activations suffer from hallucinations and vagueness. The paper also identifies text-inversion confounds that complicate evaluation.
·
8 days ago