API Reference

Reliability API

Full reference for the AgentCop Reliability Layer — all classes, methods, fields, and calculators.

ReliabilityTracer

agentcop.reliability.ReliabilityTracer

Context manager that intercepts tool calls during an agent run, computes reliability metrics, and writes a RunRecord to the ReliabilityStore. Can be used standalone or layered on top of a wrap_for_reliability-wrapped agent.

Constructor

ReliabilityTracer(
    agent_id: str,
    store_path: str = "~/.agentcop/reliability.db",
    retry_threshold: int = 3,
    budget_pressure_warn: float = 0.80,
    on_instability: Callable[[InstabilityEvent], None] | None = None,
    labels: dict[str, str] | None = None,
)

Parameter	Type	Default	Description
`agent_id`	str	required	Identifier for the agent. Used to group runs in the store and CLI reports.
`store_path`	str	`~/.agentcop/reliability.db`	Path to the SQLite store. Use an absolute path for multi-process setups.
`retry_threshold`	int	3	Number of consecutive retries on the same tool call before `retry_explosion` is flagged.
`budget_pressure_warn`	float	0.80	Token budget fraction (0–1) at which the `on_instability` callback fires mid-run.
`on_instability`	callable	None	Optional callback invoked immediately when a metric exceeds its threshold during a live run.
`labels`	dict	None	Arbitrary key/value metadata stored with the run record. Useful for environment, version, etc.

Context manager protocol

with ReliabilityTracer(agent_id="my-agent") as tracer:
    result = agent.run(input)

# After the block:
tracer.run_record    # RunRecord — populated after __exit__
tracer.elapsed_ms    # int — total run duration in milliseconds

Methods

Method	Description
`record_step(step)`	Manually record a tool step. Use with frameworks that expose a step callback instead of supporting context managers directly.
`record_token_usage(prompt_tokens, completion_tokens, model_limit)`	Manually supply token counts when automatic interception is not available.
`finalize(output, exit_status)`	Explicitly close the run record. Called automatically by `__exit__`; only call manually if not using the context manager.

ReliabilityStore

agentcop.reliability.ReliabilityStore

SQLite-backed store for run records. Manages persistence, querying, and report generation. The ReliabilityTracer writes to the store automatically; use the store directly only when you need to query historical data or build custom reports.

Constructor

ReliabilityStore(path: str = "~/.agentcop/reliability.db")

record_run

store.record_run(run: RunRecord) -> str
# Returns the run_id assigned to the stored record.

Persists a RunRecord to the store. Called automatically by ReliabilityTracer.__exit__.

get_runs

store.get_runs(
    agent_id: str,
    since: datetime | None = None,
    until: datetime | None = None,
    limit: int = 200,
) -> list[RunRecord]

Returns stored run records for the given agent, ordered by timestamp descending. Optionally filtered by date range. limit caps the result count.

get_report

store.get_report(
    agent_id: str,
    since: datetime | None = None,
    until: datetime | None = None,
    min_runs: int = 5,
) -> ReliabilityReport

Computes and returns a ReliabilityReport aggregated from all matching runs. Raises InsufficientDataError if fewer than min_runs records are found.

ReliabilityReport

agentcop.reliability.ReliabilityReport

Dataclass representing the aggregated reliability analysis for an agent across a set of runs. Returned by ReliabilityStore.get_report and by the CLI agentcop reliability report command.

Fields

Field	Type	Description
`agent_id`	str	The agent identifier this report covers.
`run_count`	int	Number of runs included in this report.
`since`	datetime	Earliest run timestamp included.
`until`	datetime	Latest run timestamp included.
`composite_score`	float	Weighted composite reliability score, 0–100.
`tier`	str	`STABLE`, `VARIABLE`, `UNSTABLE`, or `CRITICAL`.
`path_entropy_avg`	float	Mean path entropy across runs, 0–1.
`tool_variance_avg`	float	Mean tool variance across runs, 0–1.
`retry_explosion_rate`	float	Fraction of runs that triggered retry explosion, 0–1.
`branch_instability_avg`	float	Mean branch instability across runs, 0–1.
`token_budget_avg`	float	Mean token budget pressure across runs, 0–1.
`dominant_penalty`	str	Name of the metric contributing the largest penalty to the composite score.
`recommendations`	list[str]	Human-readable recommendations for the top-penalty metrics.
`score_trend`	list[float]	Composite score per run in chronological order. Useful for trend charts.
`badge_text`	str	Formatted badge string, e.g. `"🟢 STABLE 87/100"`.
`generated_at`	datetime	Timestamp when this report object was created.

Metric Calculators

Each reliability metric is computed by a dedicated calculator class. The ReliabilityTracer instantiates and runs all calculators automatically. You can also instantiate them directly to compute metrics on your own run data.

PathEntropyCalculator

agentcop.reliability.PathEntropyCalculator

from agentcop.reliability import PathEntropyCalculator

calc = PathEntropyCalculator()

# Feed sequences of tool call names
calc.observe(["search_kb", "summarize", "respond"])
calc.observe(["call_api", "parse", "search_kb", "respond"])
calc.observe(["search_kb", "call_api", "respond"])

entropy = calc.compute()   # float 0–1
# entropy = 0.97 in this example (three distinct paths)

Uses Shannon entropy over normalized path frequencies. A single repeated path has entropy 0. All unique paths have entropy approaching 1.

ToolVarianceCalculator

agentcop.reliability.ToolVarianceCalculator

from agentcop.reliability import ToolVarianceCalculator

calc = ToolVarianceCalculator(expected_tools=["web_search", "summarize"])

calc.observe(["web_search", "summarize"])
calc.observe(["web_search", "summarize", "send_email"])
calc.observe(["web_search", "summarize"])

variance = calc.compute()   # float 0–1
# variance ≈ 0.17 (one run had an unexpected tool)

Measures the fraction of runs that called one or more tools outside the expected set, weighted by how far outside the norm each unexpected call is. If expected_tools is not provided, the calculator infers the expected set from the first 10 observed runs.

RetryExplosionDetector

agentcop.reliability.RetryExplosionDetector

from agentcop.reliability import RetryExplosionDetector

detector = RetryExplosionDetector(threshold=3)

# Record individual tool call outcomes during a run
detector.record_call("fetch_record", success=False)
detector.record_call("fetch_record", success=False)
detector.record_call("fetch_record", success=False)
detector.record_call("fetch_record", success=False)  # threshold exceeded

exploded = detector.check()   # bool — True if any tool exceeded threshold
details = detector.details()  # dict[tool_name, retry_count]

Returns a boolean rather than a float. A run that triggers retry explosion receives the full −25 penalty; a run that does not receives 0 penalty. The threshold is the number of consecutive failures on the same (tool_name, input_hash) pair before the detector flags.

BranchInstabilityCalculator

agentcop.reliability.BranchInstabilityCalculator

from agentcop.reliability import BranchInstabilityCalculator

calc = BranchInstabilityCalculator()

# Record (input_hash, branch_taken) pairs across runs
calc.observe(input_hash="h1a2b3", branch="routine")
calc.observe(input_hash="h1a2b3", branch="routine")
calc.observe(input_hash="h1a2b3", branch="escalate")   # divergence
calc.observe(input_hash="h4c5d6", branch="standard")   # different input

instability = calc.compute()   # float 0–1

Groups observations by input_hash and measures the fraction of input groups where more than one distinct branch was observed. Input hashing is done automatically by the tracer (semantic similarity bucketing); when using the calculator directly, provide your own stable hash.

TokenBudgetCalculator

agentcop.reliability.TokenBudgetCalculator

from agentcop.reliability import TokenBudgetCalculator

calc = TokenBudgetCalculator(model_limit=128_000)

# Record token usage per run
calc.observe(prompt_tokens=24300, completion_tokens=1200)
calc.observe(prompt_tokens=61200, completion_tokens=3400)
calc.observe(prompt_tokens=98400, completion_tokens=5100)

pressure_avg = calc.compute()   # float 0–1 (mean budget fraction)
runs_above_warn = calc.above_threshold(warn=0.80)  # int

Computes the average of (prompt_tokens + completion_tokens) / model_limit across all observed runs. above_threshold returns the count of runs that exceeded the specified fraction. The model_limit defaults to 128,000 if not specified; pass the actual limit for your model for accurate results.

← Approvals API Reliability Concepts →