Researcher questions Anthropic verbalization paper's faithfulness

AnalysisAI Models

26 days ago

Researcher questions Anthropic verbalization paper's faithfulness

@nsaphra.bsky.social

Waiting on a robot body. All opinions are universal and held by both employers and family. ML/NLP professor. nsaphra.net

View on Bluesky

Naomi Saphra

@nsaphra.bsky.social

I have been thinking about this in light of Anthropic’s recent verbalization interp paper. It had no evidence convincing me that their verbalizations are faithful, but they are convincingly useful. Even wrong output can stimulate human creativity and increase the entropy of exploration.

26 days ago