Unfireable Safety Kernel proposes execution-time AI alignment for agents — AIBriefs

Back to AIBriefs

AnalysisPolicyAI Agents

Jun 25, 4:00 AM

Unfireable Safety Kernel proposes execution-time AI alignment for agents

Arxiv CS.AI (top papers)

The paper introduces a safety kernel that runs outside the agent's runtime, making it unremovable by the agent. It aims to prevent AI agents from bypassing safety controls in tools and APIs.

A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation5 days agoAbrar Alotaibi, Raed Mughus, Moataz Ahmed

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring5 days agoYang Gao (Veyon Solutions)

Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation5 days agoHan Jeon, Shiv Medler, Joseph Voyles, Matt Wood

What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics5 days agoSofiia Nikolenko, Michele Papucci, Mina Rezaei, Shireen Kudukkil Manchingal

PolicyAlign: Direct Policy-Based Safety Alignment for Large Language Models5 days agoChang Wu, Junfeng Fang, Houcheng Jiang, Kai Tang, Pengyu Cheng, Xiaoxi Jiang, Guanjun Jiang, Xiang Wang

To Isolate or to Score? Model-Adaptive Assessment for Cost-Efficient Multi-Agent RAG5 days agoJungseob Lee, Chanjun Park, Heuiseok Lim

LLM-Based Scientific Peer Review: Methods, Benchmarks, and Reliability Challenges5 days agoThi Huyen Nguyen, Zahra Ahmadi

Yuvion VL: A Multimodal Foundation Model for Adversarial Content and AI Safety5 days agoShikai Qiu, Xiaowen Xu, Benlei Cui, Ting Ma, Xiufeng Huang, Wenjing Jiang, Shaoxuan He, Haolei Xu, Chunyang Chai, Yujian Li, Yiliang Zhang, Guanghui Wang, Ziheng Wang, Ziwen Xu, Zhaoyu Fan, Jinhao Chen, Ruijie Jian, Hongxing Li, Chuxi Xiao, Xinyue Chen, We

·

Jun 25, 4:00 AM

Unfireable Safety Kernel proposes execution-time AI alignment for agents — AIBriefs