API Reference

Reliability API

Full reference for the AgentCop Reliability Layer — all classes, methods, fields, and calculators.

ReliabilityTracer

agentcop.reliability.ReliabilityTracer

Context manager that intercepts tool calls during an agent run, computes reliability metrics, and writes a RunRecord to the ReliabilityStore. Can be used standalone or layered on top of a wrap_for_reliability-wrapped agent.

Constructor

ReliabilityTracer(
    agent_id: str,
    store_path: str = "~/.agentcop/reliability.db",
    retry_threshold: int = 3,
    budget_pressure_warn: float = 0.80,
    on_instability: Callable[[InstabilityEvent], None] | None = None,
    labels: dict[str, str] | None = None,
)
ParameterTypeDefaultDescription
agent_idstrrequiredIdentifier for the agent. Used to group runs in the store and CLI reports.
store_pathstr~/.agentcop/reliability.dbPath to the SQLite store. Use an absolute path for multi-process setups.
retry_thresholdint3Number of consecutive retries on the same tool call before retry_explosion is flagged.
budget_pressure_warnfloat0.80Token budget fraction (0–1) at which the on_instability callback fires mid-run.
on_instabilitycallableNoneOptional callback invoked immediately when a metric exceeds its threshold during a live run.
labelsdictNoneArbitrary key/value metadata stored with the run record. Useful for environment, version, etc.

Context manager protocol

with ReliabilityTracer(agent_id="my-agent") as tracer:
    result = agent.run(input)

# After the block:
tracer.run_record    # RunRecord — populated after __exit__
tracer.elapsed_ms    # int — total run duration in milliseconds

Methods

MethodDescription
record_step(step)Manually record a tool step. Use with frameworks that expose a step callback instead of supporting context managers directly.
record_token_usage(prompt_tokens, completion_tokens, model_limit)Manually supply token counts when automatic interception is not available.
finalize(output, exit_status)Explicitly close the run record. Called automatically by __exit__; only call manually if not using the context manager.

ReliabilityStore

agentcop.reliability.ReliabilityStore

SQLite-backed store for run records. Manages persistence, querying, and report generation. The ReliabilityTracer writes to the store automatically; use the store directly only when you need to query historical data or build custom reports.

Constructor

ReliabilityStore(path: str = "~/.agentcop/reliability.db")

record_run

store.record_run(run: RunRecord) -> str
# Returns the run_id assigned to the stored record.

Persists a RunRecord to the store. Called automatically by ReliabilityTracer.__exit__.

get_runs

store.get_runs(
    agent_id: str,
    since: datetime | None = None,
    until: datetime | None = None,
    limit: int = 200,
) -> list[RunRecord]

Returns stored run records for the given agent, ordered by timestamp descending. Optionally filtered by date range. limit caps the result count.

get_report

store.get_report(
    agent_id: str,
    since: datetime | None = None,
    until: datetime | None = None,
    min_runs: int = 5,
) -> ReliabilityReport

Computes and returns a ReliabilityReport aggregated from all matching runs. Raises InsufficientDataError if fewer than min_runs records are found.

ReliabilityReport

agentcop.reliability.ReliabilityReport

Dataclass representing the aggregated reliability analysis for an agent across a set of runs. Returned by ReliabilityStore.get_report and by the CLI agentcop reliability report command.

Fields

FieldTypeDescription
agent_idstrThe agent identifier this report covers.
run_countintNumber of runs included in this report.
sincedatetimeEarliest run timestamp included.
untildatetimeLatest run timestamp included.
composite_scorefloatWeighted composite reliability score, 0–100.
tierstrSTABLE, VARIABLE, UNSTABLE, or CRITICAL.
path_entropy_avgfloatMean path entropy across runs, 0–1.
tool_variance_avgfloatMean tool variance across runs, 0–1.
retry_explosion_ratefloatFraction of runs that triggered retry explosion, 0–1.
branch_instability_avgfloatMean branch instability across runs, 0–1.
token_budget_avgfloatMean token budget pressure across runs, 0–1.
dominant_penaltystrName of the metric contributing the largest penalty to the composite score.
recommendationslist[str]Human-readable recommendations for the top-penalty metrics.
score_trendlist[float]Composite score per run in chronological order. Useful for trend charts.
badge_textstrFormatted badge string, e.g. "🟢 STABLE 87/100".
generated_atdatetimeTimestamp when this report object was created.

Metric Calculators

Each reliability metric is computed by a dedicated calculator class. The ReliabilityTracer instantiates and runs all calculators automatically. You can also instantiate them directly to compute metrics on your own run data.

PathEntropyCalculator

agentcop.reliability.PathEntropyCalculator
from agentcop.reliability import PathEntropyCalculator

calc = PathEntropyCalculator()

# Feed sequences of tool call names
calc.observe(["search_kb", "summarize", "respond"])
calc.observe(["call_api", "parse", "search_kb", "respond"])
calc.observe(["search_kb", "call_api", "respond"])

entropy = calc.compute()   # float 0–1
# entropy = 0.97 in this example (three distinct paths)

Uses Shannon entropy over normalized path frequencies. A single repeated path has entropy 0. All unique paths have entropy approaching 1.

ToolVarianceCalculator

agentcop.reliability.ToolVarianceCalculator
from agentcop.reliability import ToolVarianceCalculator

calc = ToolVarianceCalculator(expected_tools=["web_search", "summarize"])

calc.observe(["web_search", "summarize"])
calc.observe(["web_search", "summarize", "send_email"])
calc.observe(["web_search", "summarize"])

variance = calc.compute()   # float 0–1
# variance ≈ 0.17 (one run had an unexpected tool)

Measures the fraction of runs that called one or more tools outside the expected set, weighted by how far outside the norm each unexpected call is. If expected_tools is not provided, the calculator infers the expected set from the first 10 observed runs.

RetryExplosionDetector

agentcop.reliability.RetryExplosionDetector
from agentcop.reliability import RetryExplosionDetector

detector = RetryExplosionDetector(threshold=3)

# Record individual tool call outcomes during a run
detector.record_call("fetch_record", success=False)
detector.record_call("fetch_record", success=False)
detector.record_call("fetch_record", success=False)
detector.record_call("fetch_record", success=False)  # threshold exceeded

exploded = detector.check()   # bool — True if any tool exceeded threshold
details = detector.details()  # dict[tool_name, retry_count]

Returns a boolean rather than a float. A run that triggers retry explosion receives the full −25 penalty; a run that does not receives 0 penalty. The threshold is the number of consecutive failures on the same (tool_name, input_hash) pair before the detector flags.

BranchInstabilityCalculator

agentcop.reliability.BranchInstabilityCalculator
from agentcop.reliability import BranchInstabilityCalculator

calc = BranchInstabilityCalculator()

# Record (input_hash, branch_taken) pairs across runs
calc.observe(input_hash="h1a2b3", branch="routine")
calc.observe(input_hash="h1a2b3", branch="routine")
calc.observe(input_hash="h1a2b3", branch="escalate")   # divergence
calc.observe(input_hash="h4c5d6", branch="standard")   # different input

instability = calc.compute()   # float 0–1

Groups observations by input_hash and measures the fraction of input groups where more than one distinct branch was observed. Input hashing is done automatically by the tracer (semantic similarity bucketing); when using the calculator directly, provide your own stable hash.

TokenBudgetCalculator

agentcop.reliability.TokenBudgetCalculator
from agentcop.reliability import TokenBudgetCalculator

calc = TokenBudgetCalculator(model_limit=128_000)

# Record token usage per run
calc.observe(prompt_tokens=24300, completion_tokens=1200)
calc.observe(prompt_tokens=61200, completion_tokens=3400)
calc.observe(prompt_tokens=98400, completion_tokens=5100)

pressure_avg = calc.compute()   # float 0–1 (mean budget fraction)
runs_above_warn = calc.above_threshold(warn=0.80)  # int

Computes the average of (prompt_tokens + completion_tokens) / model_limit across all observed runs. above_threshold returns the count of runs that exceeded the specified fraction. The model_limit defaults to 128,000 if not specified; pass the actual limit for your model for accurate results.