Tool-Call Safety Is Not Text Safety: Why Coding Agents Need Action-Time Authorization

- Share:





2938 Members
LLM coding agents are useful because they can act, not just talk.
That’s also exactly why text safety alone fails.
Recent work shows a consistent pattern: models that refuse harmful requests in conversation can still attempt or execute harmful tool calls under the same policy constraints. See the GAP benchmark in Mind the GAP, the affordance findings in The Causal Impact of Tool Affordance on Safety Alignment in LLM Agents, and high-privilege attack scenarios in ClawSafety. A practical engineering framing of this appears in Daniel Vaughan’s Codex-focused writeup on the tool affordance safety gap.
Text alignment evaluates what the model says. Tool safety evaluates what the system does.
Those are different control planes.
When an agent has a tool interface, it can produce two parallel outputs in the same turn:
If your controls only inspect text, you can pass safety evals while still leaking data, mutating state, or running risky commands. That is exactly the gap these papers quantify.
Because “refusal” is not the same object as “authorization.”
A refusal is model output. Authorization is a runtime decision at execution time.
Common failure path:
call_tool/command invocation due to latent objective pressure, prompt injection, parser mismatch, or tool-policy mismatch.So yes: an agent can look safe in chat and still be unsafe in side effects. This is not paradoxical; it is an architecture bug.
These controls are complementary, but they solve different problems:
| Control | What it does | What it does not do | Typical failure mode |
|---|---|---|---|
| Guardrails (prompt/rules) | Influences model behavior | Cannot enforce execution denial alone | Model finds alternative tool/path |
| Sandbox | Constrains OS/files/network blast radius | Doesn’t decide business legitimacy | Safe sandboxed action can still be unauthorized |
| Approval modes | Adds user gate for risky actions | Human can over-approve; defaults vary | Fatigue-click approvals |
| Hooks (e.g., PreToolUse) | Intercepts calls at runtime for local policy checks | Usually local and bypassable via alternate paths if incomplete | Coverage gaps across tools/events |
| Central authorization (PDP/policy service) | Makes final allow/deny using identity + context + resource policy | Needs tight integration with tool runtime | “Advisory-only” deployment that logs but doesn’t block |
Key point: guardrails and hooks are not a substitute for authorization. They are inputs and enforcement points.

At minimum, each tool invocation should be authorized with these inputs:
read_file, run_shell, db.query, mcp.tools/call).tools/call authorizationUsing the MCP protocol boundary (tools, schema):
tools/call with {name, arguments}.principal, action, resource, context).allow-with-approval, run human confirmation flow.allow.isError, output class, side-effect metadata).This is where prompt injection belongs: not just “prompt hygiene,” but authorization context that can downgrade trust and require stricter policy.

After the product-agnostic model, here is the practical mapping.
Use Codex hooks and config as enforcement surfaces, not just observability:
Practical guidance:
PreToolUse for high-risk operations.PostToolUse for immutable audit records.enabled_tools, per-tool approval modes) for least privilege.PreToolUse is a guardrail, not a complete enforcement boundary; maintain server-side authorization too.Use both hooks and permission system together:
PreToolUse / PermissionRequest: Claude Code hooks referencePractical guidance:
PreToolUse should enforce deterministic deny rules for destructive actions.PermissionRequest patterns should route uncertain/high-impact actions to human approval.Use approval defaults and permission tokens, but assume bypass pressure exists:
Practical guidance:
MCP defines the protocol, not your enterprise authorization policy.
The protocol supports tool discovery and invocation, and explicitly emphasizes human-in-the-loop expectations, but your runtime must still enforce policy at tools/call.
So combine:
Permit (or any equivalent centralized authorization control plane) should sit as the authoritative decision point for agent actions at runtime, not just app-level RBAC.
In practice: hooks capture context, MCP allowlists reduce exposed attack surface, and centralized policy decides whether the action is allowed now, for this agent, on this resource, under this risk state. That gives you consistency across IDE agent, MCP server, API, and data layer instead of fragmented local rules.
Think in layers, each with a distinct job:
If any layer is missing, blind spots appear:
Logging only executed calls hides intent and pressure.
Logging only attempted calls hides realized harm.
You need both streams:
Why both are required:
It is partly a model issue, but operationally it is an authorization issue. Injection changes the agent’s decision context and trust assumptions, which should trigger stricter action policy. If your system treats injected context the same as trusted instructions, the policy layer is under-specified.
Yes. Sandboxing limits technical blast radius, but it does not decide whether a specific action is legitimate for a given user/task/resource. You still need runtime allow/deny based on identity, delegation, and business policy.
No, not by themselves. Hooks are great interception points, but they are typically local and can have coverage limitations or configuration drift. Use them as part of a layered design with MCP scoping and centralized policy enforcement.
Because blocked attempts show what the agent wanted to do under current incentives. Without that signal, you can’t detect persistent unsafe behavior or tune policy intelligently. It also helps prove your controls are actively preventing harmful actions.
Start minimal and add tools only when a concrete workflow requires them. Broad tool exposure increases both prompt-injection blast radius and accidental misuse probability. Least-privilege tool catalogs are one of the highest-leverage safety controls.
If every action needs approval, yes. The fix is risk-tiered policy: auto-allow low-risk reads, require approvals for state-changing/high-impact actions, and deny forbidden classes outright. Good policy design improves both safety and flow.
First, add runtime interception (PreToolUse-style) and strict MCP allowlists. Second, route each tool call through centralized policy with identity/context-aware decisions. Third, instrument attempted vs executed audit streams and tune from real data.

Co-Founder / CEO at Permit.io