Anthropic’s natural language autoencoders reveal hidden knowledge in Claude

AnalysisAI ModelsPolicy

2 hours ago

Anthropic’s natural language autoencoders reveal hidden knowledge in Claude

A new paper introduces natural language autoencoders that decode Claude’s internal representations into human-readable text. The technique shows that Claude often holds accurate knowledge even when its output is evasive, offering a window into model honesty.

2 hours ago