AnalysisPolicy
Jun 17, 4:00 AM
New AI safety papers propose methods to detect LLM deception and bias
Paper 'Rift' proposes a deception detection signature for language models, targeting the ELK problem. 'Actionable Activation Directions' identifies shared internal directions for emergent misalignment across model families. Multiple other papers address bias, reward hacking, and cultural context.
·
Jun 17, 4:00 AM