AnalysisPolicyAI Models
23 hours ago
Paper probes misaligned thinking in language models
The paper explores detection of strategic deception, sandbagging, and self-preservation in LLMs. It aims to improve reliability in high-stakes deployments.
·
23 hours ago