Guides

Reliability Guide

A complete practical guide to measuring and improving agent reliability — from first instrumentation to reading reports and fixing the specific metrics that are hurting your score.

Instrument your agent with ReliabilityTracer

The ReliabilityTracer is a context manager that wraps any agent invocation. It intercepts tool calls, records timing and outcomes, and writes a structured run record to the local ReliabilityStore. No network calls. No external dependencies. Just a local SQLite store under ~/.agentcop/reliability.db by default.

python
from agentcop.reliability import ReliabilityTracer

# Basic usage — wrap any agent invocation
with ReliabilityTracer(agent_id="my-agent") as tracer:
    result = agent.run("What are the top 5 results for 'AI security'?")

# The tracer automatically records:
#   - tool call sequence and timing
#   - which branches were taken
#   - retry counts per tool
#   - token usage
#   - final output and exit status

# Access the run record directly after the block
run = tracer.run_record
print(f"Reliability score: {run.reliability_score}/100")
print(f"Tier: {run.tier}")
print(f"Dominant penalty: {run.dominant_penalty}")

The tracer can also receive callbacks for real-time monitoring:

python
from agentcop.reliability import ReliabilityTracer

def on_instability(event):
    print(f"[RELIABILITY] {event.metric}: {event.value:.2f} (threshold {event.threshold})")
    # Send to Slack, PagerDuty, etc.

with ReliabilityTracer(
    agent_id="my-agent",
    on_instability=on_instability,
    retry_threshold=3,          # flag retry explosion at 3+ retries
    budget_pressure_warn=0.80,  # flag token budget above 80%
) as tracer:
    result = agent.run(user_input)

Wrap an existing adapter

If you are already using an AgentCop adapter, use wrap_for_reliability to add reliability tracking with one line. This preserves all existing security instrumentation and adds the reliability layer on top.

python
from agentcop import AgentCop
from agentcop.reliability import wrap_for_reliability

# Existing setup
cop = AgentCop(agent_id="my-agent", monitor=True, gate=True)
protected_agent = cop.wrap(my_agent)

# Add reliability tracking on top
reliable_agent = wrap_for_reliability(
    protected_agent,
    agent_id="my-agent",     # must match the AgentCop agent_id
    store_path="~/.agentcop/reliability.db",  # optional, default shown
)

# Use as normal — both security enforcement and reliability tracking active
result = reliable_agent.run(user_input)

Read the reliability report

After running your agent a few times, pull the reliability report from the CLI. You need at least 5 runs before the report is statistically meaningful; the CLI will warn you if you have fewer.

bash
# Print a summary report for a specific agent
agentcop reliability report my-agent

# Example output:
# ─────────────────────────────────────────────────────────
# AgentCop Reliability Report — my-agent
# Runs analyzed: 47  |  Window: last 7 days
# ─────────────────────────────────────────────────────────
# Composite score:      79/100  🟡 VARIABLE
#
# Path entropy:         0.21    (−3.2 pts)
# Tool variance:        0.14    (−2.8 pts)
# Retry explosion:      0.00    (−0.0 pts)  ✓
# Branch instability:   0.18    (−3.6 pts)
# Token budget avg:     0.57    (−11.4 pts) ← dominant penalty
#
# Dominant issue: token_budget_pressure
# Recommendation: reduce average prompt size or switch to a
#   larger context model. 8 of 47 runs exceeded 80% budget.
# ─────────────────────────────────────────────────────────

# Filter by date range
agentcop reliability report my-agent --since 2026-03-01 --until 2026-04-01

# Show individual run breakdown
agentcop reliability report my-agent --runs

Fix instability — what to do for each metric

High path entropy

The agent is taking different routes to the same answer. This usually means the system prompt is too vague, the tool descriptions overlap, or the model is being given too many tools.

  • Reduce the tool set — remove tools the agent does not need for this task.
  • Make tool descriptions unambiguous — if search_knowledge_base and call_external_api could both plausibly answer the same query, one of them needs a tighter description.
  • Add a step-by-step instruction to the system prompt that specifies the expected path for common query types.

High tool variance

The agent is calling tools it shouldn't need for some fraction of runs. Often caused by the model "helpfully" adding an extra step — like sending a confirmation email after completing a task — when the task doesn't require it.

  • Add explicit negative constraints to the system prompt: "Do not call send_email unless explicitly asked."
  • Use the AgentCop Gate to block unexpected tools outright — if a tool should never be called for a given agent type, block it at the policy level rather than hoping the model stays on task.
  • Review run traces for the outlier calls to understand what the model was reasoning about when it made them.

Retry explosion

A tool call is failing repeatedly and the agent has no exit strategy. Add an explicit retry limit in your tool implementation, and instruct the agent to give up and report the failure after N retries.

  • Set max_retries in your tool implementation (most frameworks support this).
  • Add to the system prompt: "If a tool call fails 3 times in a row, stop and report the failure to the user."
  • Use the retry_threshold parameter in ReliabilityTracer to get an alert before the explosion becomes expensive.

Branch instability

The agent is making different classification or routing decisions for equivalent inputs. This is the hardest metric to fix because it often reflects genuine model non-determinism at decision boundaries.

  • Set temperature=0 on the model for classification steps.
  • Use structured output (JSON mode or function calling) for branching decisions — force the model to choose from a fixed set of options rather than freeform reasoning.
  • Add worked examples to the system prompt for the boundary cases where the model is inconsistent.

Token budget pressure

The agent is regularly consuming most of its context window. Sustained high budget pressure leads to truncation, incomplete reasoning, and degraded output quality.

  • Summarize long tool responses before passing them back to the model — raw API responses are often 10x longer than necessary.
  • Use a model with a larger context window for agents that handle long documents.
  • Implement context pruning: drop early turns from the conversation history once they are no longer relevant.
  • Set a hard limit on tool response size in your tool implementation.

LangChain example

A fully instrumented LangChain ReAct agent with both security enforcement and reliability tracking active.

python
from agentcop import AgentCop, GatePolicy
from agentcop.reliability import wrap_for_reliability
from langchain.agents import initialize_agent, AgentType
from langchain.tools import DuckDuckGoSearchRun
from langchain_openai import ChatOpenAI

AGENT_ID = "langchain-search-agent"

# 1. Configure security layer
cop = AgentCop(
    agent_id=AGENT_ID,
    policy=GatePolicy(
        allow=["duckduckgo_search"],
        block=["shell_execute", "file_write"],
    ),
    monitor=True,
)

# 2. Build the LangChain agent
llm = ChatOpenAI(model="gpt-4o", temperature=0)
tools = [DuckDuckGoSearchRun()]
agent = initialize_agent(
    tools, llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    max_iterations=5,  # explicit bound prevents retry explosion
)

# 3. Wrap with security enforcement
protected_agent = cop.wrap(agent)

# 4. Add reliability tracking on top
reliable_agent = wrap_for_reliability(protected_agent, agent_id=AGENT_ID)

# 5. Run — both layers active
with reliable_agent.tracer as t:
    result = reliable_agent.run("What are the latest AI safety research papers?")
    print(f"Score: {t.run_record.reliability_score}/100 ({t.run_record.tier})")

CrewAI example

Wrapping a CrewAI crew with reliability tracking. Because CrewAI runs multiple agents in a pipeline, use a separate ReliabilityTracer per agent identity to get per-agent scores.

python
from crewai import Agent, Task, Crew
from agentcop.reliability import ReliabilityTracer
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o", temperature=0)

researcher = Agent(
    role="Researcher",
    goal="Find and summarize relevant information",
    llm=llm,
    tools=[search_tool],
)
writer = Agent(
    role="Writer",
    goal="Draft a clear summary from the research",
    llm=llm,
)

task1 = Task(description="Research AI security trends", agent=researcher)
task2 = Task(description="Write a 3-paragraph summary", agent=writer)
crew = Crew(agents=[researcher, writer], tasks=[task1, task2])

# Wrap the full crew execution in a single tracer for pipeline-level reliability
with ReliabilityTracer(agent_id="crewai-research-pipeline") as tracer:
    result = crew.kickoff()

print(f"Pipeline reliability: {tracer.run_record.reliability_score}/100")
print(f"Tier: {tracer.run_record.tier}")

# For per-agent scores, use separate tracers per agent via crew callbacks:
# researcher.step_callback = lambda step: tracer.record_step(step)

Export an HTML report

For sharing with stakeholders or storing as an artifact in CI, export the reliability report as a self-contained HTML file.

bash
# Export HTML report to current directory
agentcop reliability export my-agent --format html

# Outputs: my-agent_reliability_2026-04-08.html
# Self-contained — no external CSS/JS dependencies.

# Specify output path
agentcop reliability export my-agent --format html --output reports/weekly.html

# Export JSON for programmatic use
agentcop reliability export my-agent --format json --output reports/weekly.json

# Export and open in browser immediately (macOS)
agentcop reliability export my-agent --format html --open

The HTML report includes a score summary, trend charts for each metric over time, a per-run breakdown table, and inline recommendations for the highest-penalty metrics.

reliability is not a vanity metric. an agent with a 40/100 reliability score is a liability in production — it will behave differently tomorrow than it did today, and you won't know why until a user reports a bad outcome.