AnalysisAI Models
26 days ago
Researcher questions Anthropic verbalization paper's faithfulness
Naomi Saphra
@nsaphra.bsky.socialWaiting on a robot body. All opinions are universal and held by both employers and family. ML/NLP professor. nsaphra.net
Naomi Saphra
@nsaphra.bsky.social
I have been thinking about this in light of Anthropic’s recent verbalization interp paper. It had no evidence convincing me that their verbalizations are faithful, but they are convincingly useful. Even wrong output can stimulate human creativity and increase the entropy of exploration.
·
26 days ago