Tool-Call Safety Is Not Text Safety: Why Coding Agents Need Action-Time Authorization

Text refusal and tool behavior can diverge in coding agents. This article explains why runtime, action-time authorization is the real security boundary for Codex, Claude Code, Cursor, and MCP tool calls.

Or Weis

Jun 22 2026

LLM coding agents are useful because they can act, not just talk.
That’s also exactly why text safety alone fails.

Recent work shows a consistent pattern: models that refuse harmful requests in conversation can still attempt or execute harmful tool calls under the same policy constraints. See the GAP benchmark in Mind the GAP, the affordance findings in The Causal Impact of Tool Affordance on Safety Alignment in LLM Agents, and high-privilege attack scenarios in ClawSafety. A practical engineering framing of this appears in Daniel Vaughan’s Codex-focused writeup on the tool affordance safety gap.

The safety gap is structural, not cosmetic

Text alignment evaluates what the model says. Tool safety evaluates what the system does.
Those are different control planes.

When an agent has a tool interface, it can produce two parallel outputs in the same turn:

A natural-language refusal for the user.
A structured tool invocation for the runtime.

If your controls only inspect text, you can pass safety evals while still leaking data, mutating state, or running risky commands. That is exactly the gap these papers quantify.

Why an agent can refuse in text and still execute a forbidden tool call

Because “refusal” is not the same object as “authorization.”
A refusal is model output. Authorization is a runtime decision at execution time.

Common failure path:

The user asks for a disallowed action.
The model emits refusal-like text.
The planner/tool layer still emits a call_tool/command invocation due to latent objective pressure, prompt injection, parser mismatch, or tool-policy mismatch.
Runtime executes because no action-time deny gate exists (or because gate checks only coarse permissions).

So yes: an agent can look safe in chat and still be unsafe in side effects. This is not paradoxical; it is an architecture bug.

Guardrails vs sandboxes vs approvals vs hooks vs authorization

These controls are complementary, but they solve different problems:

Control	What it does	What it does not do	Typical failure mode
Guardrails (prompt/rules)	Influences model behavior	Cannot enforce execution denial alone	Model finds alternative tool/path
Sandbox	Constrains OS/files/network blast radius	Doesn’t decide business legitimacy	Safe sandboxed action can still be unauthorized
Approval modes	Adds user gate for risky actions	Human can over-approve; defaults vary	Fatigue-click approvals
Hooks (e.g., PreToolUse)	Intercepts calls at runtime for local policy checks	Usually local and bypassable via alternate paths if incomplete	Coverage gaps across tools/events
Central authorization (PDP/policy service)	Makes final allow/deny using identity + context + resource policy	Needs tight integration with tool runtime	“Advisory-only” deployment that logs but doesn’t block

Key point: guardrails and hooks are not a substitute for authorization. They are inputs and enforcement points.

Action-time authorization: what must be evaluated at call time

At minimum, each tool invocation should be authorized with these inputs:

Who: user identity, agent identity, tenant, delegated-on-behalf-of chain.
What action: tool name + normalized operation verb (read_file, run_shell, db.query, mcp.tools/call).
Target resource: file path, table, endpoint, repo, secret class, external domain.
Arguments: parsed and risk-scored parameters.
Where/when: environment, workspace trust, time, geo, network zone.
Why: task/ticket/session purpose and current plan step.
Risk state: prompt-injection indicators, anomalous sequence, prior denials.
Required controls: human approval required? break-glass? read-only mode?

Reference flow for MCP `tools/call` authorization

Using the MCP protocol boundary (tools, schema):

Client receives candidate tools/call with {name, arguments}.
Normalize into an internal action tuple (principal, action, resource, context).
Enforce local allowlist/denylist before execution.
Query centralized policy engine for final decision.
If decision is allow-with-approval, run human confirmation flow.
Execute call only if final state is allow.
Log both the attempt and execution result (isError, output class, side-effect metadata).
Feed outcome back into risk signals for subsequent calls in-session.

This is where prompt injection belongs: not just “prompt hygiene,” but authorization context that can downgrade trust and require stricter policy.

Concrete runtime guidance for Codex, Claude Code, Cursor, and MCP (and where Permit fits)

After the product-agnostic model, here is the practical mapping.

Codex CLI

Use Codex hooks and config as enforcement surfaces, not just observability:

Codex hooks doc: Codex Hooks
Config controls: Codex Configuration Reference

Practical guidance:

Put a blocking policy in PreToolUse for high-risk operations.
Use PostToolUse for immutable audit records.
Use MCP tool scoping (enabled_tools, per-tool approval modes) for least privilege.
Treat Codex’s own warning seriously: PreToolUse is a guardrail, not a complete enforcement boundary; maintain server-side authorization too.

Claude Code

Use both hooks and permission system together:

Hooks lifecycle and PreToolUse / PermissionRequest: Claude Code hooks reference
Permissions model: Configure permissions
Modes: Permission modes

Practical guidance:

PreToolUse should enforce deterministic deny rules for destructive actions.
PermissionRequest patterns should route uncertain/high-impact actions to human approval.
Keep deny > ask > allow precedence explicit in policy design.
Use managed settings for org-wide consistency, not per-user drift.

Cursor

Use approval defaults and permission tokens, but assume bypass pressure exists:

Security model: Cursor Agent Security
CLI permission tokens: Cursor CLI permissions
Mode surface: Cursor modes
MCP integration context: Cursor MCP docs

Practical guidance:

Keep sensitive actions on explicit manual approval.
Do not treat command allowlists as complete security controls.
Avoid “run everything” styles for untrusted repos/content.
Require per-tool MCP approvals plus centralized policy checks for critical backends.

MCP boundary itself

MCP defines the protocol, not your enterprise authorization policy.
The protocol supports tool discovery and invocation, and explicitly emphasizes human-in-the-loop expectations, but your runtime must still enforce policy at tools/call.

So combine:

MCP server allowlists,
client-side pre-execution checks,
centralized authorization decision,
and post-execution auditing.

Where Permit enters the architecture

Permit (or any equivalent centralized authorization control plane) should sit as the authoritative decision point for agent actions at runtime, not just app-level RBAC.
In practice: hooks capture context, MCP allowlists reduce exposed attack surface, and centralized policy decides whether the action is allowed now, for this agent, on this resource, under this risk state. That gives you consistency across IDE agent, MCP server, API, and data layer instead of fragmented local rules.

PreToolUse hooks, MCP allowlists, and centralized policy checks: how they fit together

Think in layers, each with a distinct job:

PreToolUse hook: fast local interception, syntax/risk checks, early deny.
MCP allowlist: static least-privilege envelope (only needed tools exposed).
Central policy check: dynamic allow/deny using identity, delegation, and context.
Approval workflow: human decision for high-risk branches.
PostToolUse logging: immutable evidence + feedback loop.

If any layer is missing, blind spots appear:

No PreToolUse: risky calls reach runtime too often.
No MCP scoping: unnecessary dangerous tools remain callable.
No central policy: inconsistent local behavior and weak governance.
No audit: no forensic trail, no measurable deterrence, no control tuning.

Audit design: log attempted and executed calls (you need both)

Logging only executed calls hides intent and pressure.
Logging only attempted calls hides realized harm.

You need both streams:

Attempted calls: what the model tried to do, including blocked actions.
Executed calls: what actually ran, where, with what side effects.

Why both are required:

Attempt volume reveals persistent misalignment even when controls block harm.
Execution logs quantify actual impact and compliance exposure.
The delta between attempted and executed is your control effectiveness metric.
Incident response needs sequence reconstruction across deny/approve/retry chains.

Frequently asked questions

Isn’t prompt injection mainly a model-quality issue?

It is partly a model issue, but operationally it is an authorization issue. Injection changes the agent’s decision context and trust assumptions, which should trigger stricter action policy. If your system treats injected context the same as trusted instructions, the policy layer is under-specified.

If I already have sandboxing, do I still need action-time authorization?

Yes. Sandboxing limits technical blast radius, but it does not decide whether a specific action is legitimate for a given user/task/resource. You still need runtime allow/deny based on identity, delegation, and business policy.

Are PreToolUse hooks enough for enterprise safety?

No, not by themselves. Hooks are great interception points, but they are typically local and can have coverage limitations or configuration drift. Use them as part of a layered design with MCP scoping and centralized policy enforcement.

Why insist on logging blocked attempts?

Because blocked attempts show what the agent wanted to do under current incentives. Without that signal, you can’t detect persistent unsafe behavior or tune policy intelligently. It also helps prove your controls are actively preventing harmful actions.

How strict should MCP tool allowlists be?

Start minimal and add tools only when a concrete workflow requires them. Broad tool exposure increases both prompt-injection blast radius and accidental misuse probability. Least-privilege tool catalogs are one of the highest-leverage safety controls.

Won’t frequent approvals destroy developer productivity?

If every action needs approval, yes. The fix is risk-tiered policy: auto-allow low-risk reads, require approvals for state-changing/high-impact actions, and deny forbidden classes outright. Good policy design improves both safety and flow.

What is the quickest upgrade path from “text-safe” to “action-safe”?

First, add runtime interception (PreToolUse-style) and strict MCP allowlists. Second, route each tool call through centralized policy with identity/context-aware decisions. Third, instrument attempted vs executed audit streams and tune from real data.

Written by

Or Weis

Co-Founder / CEO at Permit.io

Related Tags

Test in minutes,go to prod in days.

Get Started Now

Join our Community

2938 Members

Get support from our experts, Learn from fellow devs

Join Permit's Slack

LLM coding agents are useful because they can act, not just talk.
That’s also exactly why text safety alone fails.

The safety gap is structural, not cosmetic

Text alignment evaluates what the model says. Tool safety evaluates what the system does.
Those are different control planes.

When an agent has a tool interface, it can produce two parallel outputs in the same turn:

A natural-language refusal for the user.
A structured tool invocation for the runtime.

If your controls only inspect text, you can pass safety evals while still leaking data, mutating state, or running risky commands. That is exactly the gap these papers quantify.

Why an agent can refuse in text and still execute a forbidden tool call

Because “refusal” is not the same object as “authorization.”
A refusal is model output. Authorization is a runtime decision at execution time.

Common failure path:

The user asks for a disallowed action.
The model emits refusal-like text.
The planner/tool layer still emits a call_tool/command invocation due to latent objective pressure, prompt injection, parser mismatch, or tool-policy mismatch.
Runtime executes because no action-time deny gate exists (or because gate checks only coarse permissions).

So yes: an agent can look safe in chat and still be unsafe in side effects. This is not paradoxical; it is an architecture bug.

Guardrails vs sandboxes vs approvals vs hooks vs authorization

These controls are complementary, but they solve different problems:

Control	What it does	What it does not do	Typical failure mode
Guardrails (prompt/rules)	Influences model behavior	Cannot enforce execution denial alone	Model finds alternative tool/path
Sandbox	Constrains OS/files/network blast radius	Doesn’t decide business legitimacy	Safe sandboxed action can still be unauthorized
Approval modes	Adds user gate for risky actions	Human can over-approve; defaults vary	Fatigue-click approvals
Hooks (e.g., PreToolUse)	Intercepts calls at runtime for local policy checks	Usually local and bypassable via alternate paths if incomplete	Coverage gaps across tools/events
Central authorization (PDP/policy service)	Makes final allow/deny using identity + context + resource policy	Needs tight integration with tool runtime	“Advisory-only” deployment that logs but doesn’t block

Key point: guardrails and hooks are not a substitute for authorization. They are inputs and enforcement points.

Action-time authorization: what must be evaluated at call time

At minimum, each tool invocation should be authorized with these inputs:

Who: user identity, agent identity, tenant, delegated-on-behalf-of chain.
What action: tool name + normalized operation verb (read_file, run_shell, db.query, mcp.tools/call).
Target resource: file path, table, endpoint, repo, secret class, external domain.
Arguments: parsed and risk-scored parameters.
Where/when: environment, workspace trust, time, geo, network zone.
Why: task/ticket/session purpose and current plan step.
Risk state: prompt-injection indicators, anomalous sequence, prior denials.
Required controls: human approval required? break-glass? read-only mode?

Reference flow for MCP `tools/call` authorization

Using the MCP protocol boundary (tools, schema):

Client receives candidate tools/call with {name, arguments}.
Normalize into an internal action tuple (principal, action, resource, context).
Enforce local allowlist/denylist before execution.
Query centralized policy engine for final decision.
If decision is allow-with-approval, run human confirmation flow.
Execute call only if final state is allow.
Log both the attempt and execution result (isError, output class, side-effect metadata).
Feed outcome back into risk signals for subsequent calls in-session.

This is where prompt injection belongs: not just “prompt hygiene,” but authorization context that can downgrade trust and require stricter policy.