CAUM is an innovative observation layer for AI agents that analyzes execution patterns without interpreting prompts or outputs. It effectively identifies loops and stagnation, ensuring enhanced efficiency in autonomous agent operations. Validated across 80K real sessions, CAUM delivers insights that significantly reduce computational waste.
CAUM is an innovative structural observation layer designed specifically for AI agents. It enables the detection of loops, stagnation, and inefficient compute usage during autonomous agent execution, all without the need to interpret prompts or payloads.
Key Features
- Behavioral Monitoring: CAUM tracks how an agent operates, evaluating tool call diversity, trajectory geometry, and regime transitions.
- Independent Recording: The core functionality is to record the actions of agents without making decisions on their behalf.
Performance Insights
With validation on 80,036 real agent sessions from the nebius/SWE-agent-trajectories dataset, CAUM demonstrates significant efficacy:
| Metric | Value |
|---|---|
| AUC @ step 10 | 0.741 |
| AUC @ full session | 0.814 |
| Cohen's d | +0.977 |
| Probability of failure under LOOP regime | 88.7% |
| LOOP detection F1 Score | 0.742 |
| Average session length (failed vs. successful) | 2x longer |
| Estimated compute savings with 10K runs per day | ~$1.7M/year |
Cross-model validation shows consistent results across various models without the requirement for retraining:
| Model | AUC @ step 10 | AUC @ full |
|---|---|---|
| Llama 8B | 0.730 | 0.816 |
| Llama 70B | 0.668 | 0.778 |
| Llama 405B | 0.584 | 0.776 |
Operational Mechanism
CAUM processes agent activity through:
Agent Steps → SBERT Embeddings → Trajectory Analysis → Regime Classification → UDS Score + Attestation
It leverages five structural signals for comprehensive analysis:
TCR: Tool Coherence Ratio - Assessing tool diversity.ESR: Execution Substance Ratio - Measuring meaningfulness of steps.SCI: Structural Coherence Index - Evaluating trajectory progress.ZT Similarity: Zero-Trust cosine similarity between consecutive steps.Regime: Classifying activity into EXPLORER, GRIND, STAGNATION, or LOOP.
Example Output
{
"uds": 0.73, # Unified Dynamic Score [0-1]
"tier": "T2", # Tier level, T1 (healthy) → T5 (critical)
"dominant_regime": "EXPLORER",
"shields_fired": 0,
"advisory": {"level": 0, "level_name": "CLEAR"}
}
Utilization Scenarios
Use CAUM to:
- Analyze step-by-step actions of agents in real-time.
- Score JSONL files directly for batch analysis.
Quick Examples
To analyze steps:
from caum_monitor_v10 import ZeroTrustAuditor
auditor = ZeroTrustAuditor()
auditor.push("file_search", "find authentication.py in /src")
auditor.push("file_reader", "reading JWT validation at line 42")
auditor.push("code_editor", "fixed algorithm parameter in jwt.decode")
auditor.push("test_runner", "pytest — 12 passed in 1.2s")
auditor.push("submit", "PR #142 submitted")
report = auditor.finalize()
print(f"UDS={report['uds']} tier={report['tier']} regime={report['dominant_regime']}")
REST API Request Example
curl -X POST https://caum-observation-production.up.railway.app/v1/analyze \
-H "Content-Type: application/json" \
-d '{"trajectory": [{"tool": "search", "content": "..."}, ...]}'
Behavioral Regimes Overview
CAUM categorizes agent behavior into various regimes:
| Regime | Resolve Rate | Description |
|---|---|---|
| EXPLORER | 67.8% | Indicates healthy agent progress |
| WARMING_UP | 56.6% | Early data stages with room for improvement |
| GRIND | 36.9% | Warning of over-repetition while still advancing |
| LOOP | 11.3% | High chance of failure, signaling critical issues |
Deployment Flexibility
Choose appropriate deployment modes:
- Forensic: Batch analysis of past sessions for compliance.
- Live API: Real-time monitoring with alerting.
- Enterprise SDK: Integrated within existing infrastructure with no data transfer.
For further exploration or to test capabilities, visit caum.systems/upload for a free structural observation report.
No comments yet.
Sign in to be the first to comment.