CAUM - Monitor AI agent performance by detecting loops and wasted resources.

CAUM

Monitor AI agent performance by detecting loops and wasted resources.

Pitch

CAUM is an innovative observation layer for AI agents that analyzes execution patterns without interpreting prompts or outputs. It effectively identifies loops and stagnation, ensuring enhanced efficiency in autonomous agent operations. Validated across 80K real sessions, CAUM delivers insights that significantly reduce computational waste.

Description

CAUM is an innovative structural observation layer designed specifically for AI agents. It enables the detection of loops, stagnation, and inefficient compute usage during autonomous agent execution, all without the need to interpret prompts or payloads.

Key Features

Behavioral Monitoring: CAUM tracks how an agent operates, evaluating tool call diversity, trajectory geometry, and regime transitions.
Independent Recording: The core functionality is to record the actions of agents without making decisions on their behalf.

Performance Insights

With validation on 80,036 real agent sessions from the nebius/SWE-agent-trajectories dataset, CAUM demonstrates significant efficacy:

Metric	Value
AUC @ step 10	0.741
AUC @ full session	0.814
Cohen's d	+0.977
Probability of failure under LOOP regime	88.7%
LOOP detection F1 Score	0.742
Average session length (failed vs. successful)	2x longer
Estimated compute savings with 10K runs per day	~$1.7M/year

Cross-model validation shows consistent results across various models without the requirement for retraining:

Model	AUC @ step 10	AUC @ full
Llama 8B	0.730	0.816
Llama 70B	0.668	0.778
Llama 405B	0.584	0.776

Operational Mechanism

CAUM processes agent activity through:

Agent Steps → SBERT Embeddings → Trajectory Analysis → Regime Classification → UDS Score + Attestation

It leverages five structural signals for comprehensive analysis:

TCR: Tool Coherence Ratio - Assessing tool diversity.
ESR: Execution Substance Ratio - Measuring meaningfulness of steps.
SCI: Structural Coherence Index - Evaluating trajectory progress.
ZT Similarity: Zero-Trust cosine similarity between consecutive steps.
Regime: Classifying activity into EXPLORER, GRIND, STAGNATION, or LOOP.

Example Output

{
  "uds": 0.73,           # Unified Dynamic Score [0-1]
  "tier": "T2",          # Tier level, T1 (healthy) → T5 (critical)
  "dominant_regime": "EXPLORER",
  "shields_fired": 0,
  "advisory": {"level": 0, "level_name": "CLEAR"}
}

Utilization Scenarios

Use CAUM to:

Analyze step-by-step actions of agents in real-time.
Score JSONL files directly for batch analysis.

Quick Examples

To analyze steps:

from caum_monitor_v10 import ZeroTrustAuditor

auditor = ZeroTrustAuditor()

auditor.push("file_search", "find authentication.py in /src")
auditor.push("file_reader", "reading JWT validation at line 42")
auditor.push("code_editor", "fixed algorithm parameter in jwt.decode")
auditor.push("test_runner", "pytest — 12 passed in 1.2s")
auditor.push("submit", "PR #142 submitted")

report = auditor.finalize()
print(f"UDS={report['uds']} tier={report['tier']} regime={report['dominant_regime']}")

REST API Request Example

curl -X POST https://caum-observation-production.up.railway.app/v1/analyze \
  -H "Content-Type: application/json" \
  -d '{"trajectory": [{"tool": "search", "content": "..."}, ...]}'

Behavioral Regimes Overview

CAUM categorizes agent behavior into various regimes:

Regime	Resolve Rate	Description
EXPLORER	67.8%	Indicates healthy agent progress
WARMING_UP	56.6%	Early data stages with room for improvement
GRIND	36.9%	Warning of over-repetition while still advancing
LOOP	11.3%	High chance of failure, signaling critical issues

Deployment Flexibility

Choose appropriate deployment modes:

Forensic: Batch analysis of past sessions for compliance.
Live API: Real-time monitoring with alerting.
Enterprise SDK: Integrated within existing infrastructure with no data transfer.

For further exploration or to test capabilities, visit caum.systems/upload for a free structural observation report.

0 comments

No comments yet.

New comment