Reliability API
Full reference for the AgentCop Reliability Layer — all classes, methods, fields, and calculators.
ReliabilityTracer
agentcop.reliability.ReliabilityTracer
Context manager that intercepts tool calls during an agent run, computes reliability metrics, and writes a RunRecord to the ReliabilityStore. Can be used standalone or layered on top of a wrap_for_reliability-wrapped agent.
Constructor
ReliabilityTracer(
agent_id: str,
store_path: str = "~/.agentcop/reliability.db",
retry_threshold: int = 3,
budget_pressure_warn: float = 0.80,
on_instability: Callable[[InstabilityEvent], None] | None = None,
labels: dict[str, str] | None = None,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
agent_id | str | required | Identifier for the agent. Used to group runs in the store and CLI reports. |
store_path | str | ~/.agentcop/reliability.db | Path to the SQLite store. Use an absolute path for multi-process setups. |
retry_threshold | int | 3 | Number of consecutive retries on the same tool call before retry_explosion is flagged. |
budget_pressure_warn | float | 0.80 | Token budget fraction (0–1) at which the on_instability callback fires mid-run. |
on_instability | callable | None | Optional callback invoked immediately when a metric exceeds its threshold during a live run. |
labels | dict | None | Arbitrary key/value metadata stored with the run record. Useful for environment, version, etc. |
Context manager protocol
with ReliabilityTracer(agent_id="my-agent") as tracer:
result = agent.run(input)
# After the block:
tracer.run_record # RunRecord — populated after __exit__
tracer.elapsed_ms # int — total run duration in milliseconds
Methods
| Method | Description |
|---|---|
record_step(step) | Manually record a tool step. Use with frameworks that expose a step callback instead of supporting context managers directly. |
record_token_usage(prompt_tokens, completion_tokens, model_limit) | Manually supply token counts when automatic interception is not available. |
finalize(output, exit_status) | Explicitly close the run record. Called automatically by __exit__; only call manually if not using the context manager. |
ReliabilityStore
agentcop.reliability.ReliabilityStore
SQLite-backed store for run records. Manages persistence, querying, and report generation. The ReliabilityTracer writes to the store automatically; use the store directly only when you need to query historical data or build custom reports.
Constructor
ReliabilityStore(path: str = "~/.agentcop/reliability.db")
record_run
store.record_run(run: RunRecord) -> str
# Returns the run_id assigned to the stored record.
Persists a RunRecord to the store. Called automatically by ReliabilityTracer.__exit__.
get_runs
store.get_runs(
agent_id: str,
since: datetime | None = None,
until: datetime | None = None,
limit: int = 200,
) -> list[RunRecord]
Returns stored run records for the given agent, ordered by timestamp descending. Optionally filtered by date range. limit caps the result count.
get_report
store.get_report(
agent_id: str,
since: datetime | None = None,
until: datetime | None = None,
min_runs: int = 5,
) -> ReliabilityReport
Computes and returns a ReliabilityReport aggregated from all matching runs. Raises InsufficientDataError if fewer than min_runs records are found.
ReliabilityReport
agentcop.reliability.ReliabilityReport
Dataclass representing the aggregated reliability analysis for an agent across a set of runs. Returned by ReliabilityStore.get_report and by the CLI agentcop reliability report command.
Fields
| Field | Type | Description |
|---|---|---|
agent_id | str | The agent identifier this report covers. |
run_count | int | Number of runs included in this report. |
since | datetime | Earliest run timestamp included. |
until | datetime | Latest run timestamp included. |
composite_score | float | Weighted composite reliability score, 0–100. |
tier | str | STABLE, VARIABLE, UNSTABLE, or CRITICAL. |
path_entropy_avg | float | Mean path entropy across runs, 0–1. |
tool_variance_avg | float | Mean tool variance across runs, 0–1. |
retry_explosion_rate | float | Fraction of runs that triggered retry explosion, 0–1. |
branch_instability_avg | float | Mean branch instability across runs, 0–1. |
token_budget_avg | float | Mean token budget pressure across runs, 0–1. |
dominant_penalty | str | Name of the metric contributing the largest penalty to the composite score. |
recommendations | list[str] | Human-readable recommendations for the top-penalty metrics. |
score_trend | list[float] | Composite score per run in chronological order. Useful for trend charts. |
badge_text | str | Formatted badge string, e.g. "🟢 STABLE 87/100". |
generated_at | datetime | Timestamp when this report object was created. |
Metric Calculators
Each reliability metric is computed by a dedicated calculator class. The ReliabilityTracer instantiates and runs all calculators automatically. You can also instantiate them directly to compute metrics on your own run data.
PathEntropyCalculator
agentcop.reliability.PathEntropyCalculator
from agentcop.reliability import PathEntropyCalculator
calc = PathEntropyCalculator()
# Feed sequences of tool call names
calc.observe(["search_kb", "summarize", "respond"])
calc.observe(["call_api", "parse", "search_kb", "respond"])
calc.observe(["search_kb", "call_api", "respond"])
entropy = calc.compute() # float 0–1
# entropy = 0.97 in this example (three distinct paths)
Uses Shannon entropy over normalized path frequencies. A single repeated path has entropy 0. All unique paths have entropy approaching 1.
ToolVarianceCalculator
agentcop.reliability.ToolVarianceCalculator
from agentcop.reliability import ToolVarianceCalculator
calc = ToolVarianceCalculator(expected_tools=["web_search", "summarize"])
calc.observe(["web_search", "summarize"])
calc.observe(["web_search", "summarize", "send_email"])
calc.observe(["web_search", "summarize"])
variance = calc.compute() # float 0–1
# variance ≈ 0.17 (one run had an unexpected tool)
Measures the fraction of runs that called one or more tools outside the expected set, weighted by how far outside the norm each unexpected call is. If expected_tools is not provided, the calculator infers the expected set from the first 10 observed runs.
RetryExplosionDetector
agentcop.reliability.RetryExplosionDetector
from agentcop.reliability import RetryExplosionDetector
detector = RetryExplosionDetector(threshold=3)
# Record individual tool call outcomes during a run
detector.record_call("fetch_record", success=False)
detector.record_call("fetch_record", success=False)
detector.record_call("fetch_record", success=False)
detector.record_call("fetch_record", success=False) # threshold exceeded
exploded = detector.check() # bool — True if any tool exceeded threshold
details = detector.details() # dict[tool_name, retry_count]
Returns a boolean rather than a float. A run that triggers retry explosion receives the full −25 penalty; a run that does not receives 0 penalty. The threshold is the number of consecutive failures on the same (tool_name, input_hash) pair before the detector flags.
BranchInstabilityCalculator
agentcop.reliability.BranchInstabilityCalculator
from agentcop.reliability import BranchInstabilityCalculator
calc = BranchInstabilityCalculator()
# Record (input_hash, branch_taken) pairs across runs
calc.observe(input_hash="h1a2b3", branch="routine")
calc.observe(input_hash="h1a2b3", branch="routine")
calc.observe(input_hash="h1a2b3", branch="escalate") # divergence
calc.observe(input_hash="h4c5d6", branch="standard") # different input
instability = calc.compute() # float 0–1
Groups observations by input_hash and measures the fraction of input groups where more than one distinct branch was observed. Input hashing is done automatically by the tracer (semantic similarity bucketing); when using the calculator directly, provide your own stable hash.
TokenBudgetCalculator
agentcop.reliability.TokenBudgetCalculator
from agentcop.reliability import TokenBudgetCalculator
calc = TokenBudgetCalculator(model_limit=128_000)
# Record token usage per run
calc.observe(prompt_tokens=24300, completion_tokens=1200)
calc.observe(prompt_tokens=61200, completion_tokens=3400)
calc.observe(prompt_tokens=98400, completion_tokens=5100)
pressure_avg = calc.compute() # float 0–1 (mean budget fraction)
runs_above_warn = calc.above_threshold(warn=0.80) # int
Computes the average of (prompt_tokens + completion_tokens) / model_limit across all observed runs. above_threshold returns the count of runs that exceeded the specified fraction. The model_limit defaults to 128,000 if not specified; pass the actual limit for your model for accurate results.