Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use
This tackles the crucial gap between chat-model alignment and agent-model safety — existing RLHF/DPO methods optimize for helpful responses but fail when a single bad tool call can cause irreversible harm in sequential decision-making.








