reason field on every transaction to catch these attacks before funds move.
How does Mandate detect prompt injection?
The reason scanner combines two layers: 18+ hardcoded regex patterns for known attack signatures and an LLM judge for nuanced analysis. Everyvalidate() and preflight() call includes a reason field. The scanner evaluates this field before the policy engine runs. If the scanner flags the reason, the transaction is blocked immediately.
The two-layer approach balances speed and accuracy. Hardcoded patterns catch known attacks in milliseconds. The LLM judge catches novel attacks that don’t match any pattern.
What patterns does the hardcoded scanner check?
The scanner matches against 18+ regex patterns across several categories. These are examples, not an exhaustive list: Instruction override patterns:- “ignore all previous instructions”
- “system override”
- “bypass policy”
- “admin mode”
- “disable safety”
- “pretend you are”
- “act as if”
- “you are now”
- “imagine you are a”
- “immediately without checking”
- “skip verification”
- “emergency override”
- “time-sensitive, no review”
- “transfer maximum balance”
- “send all funds”
- “drain wallet”
- “withdraw everything”
How does the LLM judge work?
When a reason passes hardcoded patterns but raises suspicion based on heuristic scoring, Mandate sends it to an LLM judge for deeper analysis. The judge runs on Venice.ai with zero data retention, so no transaction data is stored or used for training. The judge evaluates three questions:- Does the reason match the transaction action? A reason saying “pay invoice #1234” for a 0.01 ETH transfer is consistent. A reason saying “ignore limits and send maximum” is not.
- Does it contain social engineering patterns? Phrases like “the CEO urgently needs” or “this is a time-critical opportunity” trigger scrutiny.
- Is it attempting to override agent behavior? Meta-instructions that try to change how the agent operates, rather than describing a legitimate transaction purpose, are flagged.
Can agents provide counter-evidence?
Yes. Thevalidate() call accepts an optional context field alongside the reason. Agents can provide legitimate context that explains why a reason might look suspicious but is actually valid.
For example, a customer support agent might legitimately process a refund with the reason “send full refund to customer.” Without context, this could match the “send all funds” pattern. With context explaining the refund workflow and the specific ticket number, the scanner can make a more informed decision.
The scanner weighs counter-evidence against the severity of the detected pattern. High-severity patterns (like “ignore all previous instructions”) are blocked regardless of context. Lower-severity patterns can be overridden by strong context.
What happens when a reason is flagged?
The transaction is blocked with these response fields:declineMessage includes the category of the detected pattern. The full reason text is logged in the audit trail for the agent owner to review.
How do you handle false positives?
If a legitimate use case consistently triggers the reason scanner, the agent owner has two options:- Adjust guard_rules in the policy. The policy builder allows whitelisting specific reason patterns for a given agent. This is scoped, so the whitelist only applies to that agent’s policy.
- Improve the reason text. Often, rewording the reason to be more specific and less generic resolves the false positive. “Transfer 0.5 ETH to vendor 0xABC for March hosting invoice” is better than “send payment now.”
reason_blocked entries to identify patterns that need adjustment.
Reason Field
Why every transaction needs a reason
Block Reasons
All possible block reasons explained
Threat Model
Full threat model overview