New AI safety papers propose methods to detect LLM deception and bias

AnalysisPolicy

Jun 17, 4:00 AM

New AI safety papers propose methods to detect LLM deception and bias

Paper 'Rift' proposes a deception detection signature for language models, targeting the ELK problem. 'Actionable Activation Directions' identifies shared internal directions for emergent misalignment across model families. Multiple other papers address bias, reward hacking, and cultural context.

Exposing the Unsaid: Visualizing Hidden LLM Bias through Stochastic Path Aggregation3 days agoMatteo Pelossi, Rita Sevastjanova, Thilo Spinner, Mennatallah El-Assady

The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs4 days agoNaihao Deng, Yiming Feng, Chimaobi Okite, Kaijian Zou, Lu Wang, Rada Mihalcea, Yulong Chen

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning4 days agoBo Su, Ankit Shah, Thai Le

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection4 days agoJinhan Li, Kexian Tang, Yihan Xu, Zhuorui Ye, Kaifeng Lyu

Evaluating Second-Order Bias of LLMs Through Epistemic Entitlement5 days agoRamaravind Kommiya Mothilal, Terry Jingchen Zhang, Raiyan Ahmed, Zhijing Jin, Shion Guha, Syed Ishtiaque Ahmed

PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience5 days agoXinyang Liao, Lingyu Li, Huacan Liu, Tianle Gu, Yang Yao, Tong Zhu, Yan Teng, Yingchun Wang

LLMs Infer Cultural Context but Fail to Apply It When Responding5 days agoYisong Miao, Jian Zhu, Vered Shwartz

Unintended Effects of Geographic Conditioning in Large Language Models5 days agoNaz Col, David M. Chan

Rift: A Conflict Signature for Deception in Language Models5 days agoPetr Nyoma

Reward Hacking in Language Model Agents: Revisiting AI Safety Gridworlds6 days ago\"Omer Veysel \c{C}a\u{g}atan, Xuandong Zhao

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning6 days agoZhenyu Yu

Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation6 days agoXian Sun, Wei Gao, Yingshuo Wang, Lingdong Kong, Yanhang Li, Zhichao Fan, Zexin Zhuang, Wenlong Dong, Zhiyuan Zheng, Hrishikesh Paranjape, Abhishek Mandal, Johnny R. Zhang

Jun 17, 4:00 AM