Concepts

Reliability Layer

Agents are not deterministic. The same prompt can take five different tool paths on five consecutive runs. The Reliability Layer measures that variance — and gives you a score you can act on.

Path Entropy

Path entropy measures how consistently an agent reaches its goal via the same sequence of tool calls. An agent with low entropy takes roughly the same steps every run. An agent with high entropy wanders — picking different tools, reordering steps, or arriving at the answer via routes that vary dramatically between runs.

Why it matters: high path entropy makes it hard to reason about what your agent will do in production. A support agent that sometimes uses search_knowledge_base and sometimes uses call_external_api for the same query is a reliability risk — and potentially a security risk if those paths have different permission requirements.

text

Run 1: search_kb → summarize → respond           (path A)
Run 2: call_api → parse → search_kb → respond    (path B)
Run 3: search_kb → call_api → respond            (path C)

Path entropy = HIGH — three distinct paths for the same input.
A well-tuned agent would converge on path A every time.

Path entropy is calculated as the Shannon entropy of observed tool-call sequences across the last N runs, normalized to 0–1. A score of 0 means every run followed the identical path. A score of 1 means every run took a completely different path.

Tool Variance

Tool variance measures how stable your agent's tool selection is across runs with equivalent inputs. It differs from path entropy in that it looks at which tools were called, not in what order. An agent with low tool variance calls roughly the same set of tools for the same class of task.

Concrete example: a research agent should call web_search and summarize for every research query. If it occasionally also calls send_email or write_to_database without a clear reason, tool variance is high — and those unexpected calls may represent a security boundary being crossed unpredictably.

text

Expected tool set: {web_search, summarize}
Observed across 20 runs:
  - web_search called: 20/20 runs    ← stable
  - summarize called: 18/20 runs     ← mostly stable
  - send_email called: 3/20 runs     ← unexpected variance
  - write_to_database called: 1/20   ← high-severity variance

Tool variance score: 0.31  (0 = perfectly stable, 1 = fully random)

Retry Explosion

Retry explosion tracks how often your agent re-attempts a tool call after a failure — and whether those retries are bounded. An agent that retries a failing API call 40 times is not a reliable agent. It's a runaway process that wastes tokens, burns rate limits, and may trigger downstream alerts.

Concrete example: a data pipeline agent calls fetch_record(id=42). The record doesn't exist. The agent retries. And retries. And retries. Without an explicit retry limit, this loop can run until the token budget is exhausted or the process is killed externally. The Retry Explosion detector flags any run where a single tool call is retried more than a configurable threshold (default: 3).

text

tool_call: fetch_record(id=42) → FAILED (not found)
tool_call: fetch_record(id=42) → FAILED (not found)
tool_call: fetch_record(id=42) → FAILED (not found)
tool_call: fetch_record(id=42) → FAILED (not found)  ← threshold exceeded
tool_call: fetch_record(id=42) → FAILED (not found)
...

retry_explosion = TRUE  (5 retries on same call, threshold = 3)
Reliability penalty: −25 points

Branch Instability

Branch instability measures how often your agent takes conditional branches that differ across runs with the same input. A stable agent makes the same decisions given the same context. An unstable agent flips between branches — sometimes taking the safe path, sometimes the expensive path, sometimes erroring out — with no clear reason.

Concrete example: a customer support agent is given the same complaint. On some runs it classifies it as "routine" and routes to the knowledge base. On other runs it classifies it as "escalation" and drafts an email to a human. The input is identical; the outcome is not. Branch instability is measured by tracking decision-point outcomes across runs with semantically equivalent inputs.

text

Input: "My order hasn't arrived after 5 days."
Run 1 → branch: "routine"   → lookup_order_status
Run 2 → branch: "routine"   → lookup_order_status
Run 3 → branch: "escalate"  → draft_escalation_email   ← divergence
Run 4 → branch: "routine"   → lookup_order_status
Run 5 → branch: "escalate"  → draft_escalation_email   ← divergence

Branch instability = 0.4 (40% of runs took an unexpected branch)

Token Budget Pressure

Token budget pressure measures how close your agent runs to its context window limit. An agent consistently operating at 90%+ of its token budget is fragile: one slightly longer input, one verbose tool response, and the run fails or truncates. Budget pressure is measured as the ratio of tokens consumed to the model's context limit, averaged across runs.

Concrete example: a summarization agent is given documents of varying length. Short documents use 20% of the token budget — fine. But long documents push to 95%, and at that point the model starts truncating its own reasoning, producing incomplete outputs. Budget pressure above 80% on average is flagged as UNSTABLE.

text

Context window: 128,000 tokens
Run 1: 24,300 tokens used  (19%)  ← comfortable
Run 2: 61,200 tokens used  (48%)  ← comfortable
Run 3: 98,400 tokens used  (77%)  ← elevated
Run 4: 121,000 tokens used (94%)  ← CRITICAL — near truncation
Run 5: 115,000 tokens used (90%)  ← CRITICAL

Average budget pressure: 65.6%  → VARIABLE tier

Reliability Score

The composite Reliability Score is a weighted penalty model that combines all five metrics into a single 0–100 value. Higher is more reliable.

text

Reliability Score = 100
  − (path_entropy        × 15)   # max −15
  − (tool_variance       × 20)   # max −20
  − (retry_explosion     × 25)   # max −25 (binary: 0 or 1)
  − (branch_instability  × 20)   # max −20
  − (token_budget_avg    × 20)   # max −20

All metrics normalized to 0–1 before multiplication.
Score clamped to 0–100.

Example:
  path_entropy = 0.2   → −3
  tool_variance = 0.1  → −2
  retry_explosion = 0  → −0
  branch_instability = 0.15 → −3
  token_budget_avg = 0.65   → −13

  Score = 100 − 3 − 2 − 0 − 3 − 13 = 79  → VARIABLE

Reliability Tiers

Every agent run is assigned a reliability tier based on its composite score.

Tier	Score Range	Badge	What it means
STABLE	80–100	🟢 STABLE	Agent behavior is consistent and predictable. Low variance across all metrics.
VARIABLE	60–79	🟡 VARIABLE	Some variance detected. Investigate the dominant penalty metric before shipping to production.
UNSTABLE	40–59	🟠 UNSTABLE	High variance. Agent behavior is unpredictable. Tune the agent before relying on it for production workloads.
CRITICAL	0–39	🔴 CRITICAL	Severe instability. One or more metrics are in a failure state. Do not run in production.

Badge Format

The Reliability Layer adds a second dimension to the existing AgentCop trust badge. The combined badge shows both the security trust score and the reliability score in a single line.

text

✅ SECURED 94/100 | 🟢 STABLE 87/100

Format breakdown:
  ✅  = trust score above 80 (SECURED)
  94/100 = security trust score
  |   = separator
  🟢  = reliability tier indicator
  STABLE = reliability tier label
  87/100 = reliability composite score

Other examples:
  ✅ SECURED 88/100 | 🟡 VARIABLE 71/100
  ⚠️  MODERATE 62/100 | 🟠 UNSTABLE 55/100
  ❌ CRITICAL 31/100 | 🔴 CRITICAL 22/100

a high security trust score does not imply a high reliability score, and vice versa. an agent can be free of known vulnerabilities and still be wildly unpredictable at runtime. both dimensions matter.