PitchHut logo
Monitor AI agent performance by detecting loops and wasted resources.
Pitch

CAUM is an innovative observation layer for AI agents that analyzes execution patterns without interpreting prompts or outputs. It effectively identifies loops and stagnation, ensuring enhanced efficiency in autonomous agent operations. Validated across 80K real sessions, CAUM delivers insights that significantly reduce computational waste.

Description

CAUM is an innovative structural observation layer designed specifically for AI agents. It enables the detection of loops, stagnation, and inefficient compute usage during autonomous agent execution, all without the need to interpret prompts or payloads.

Key Features

  • Behavioral Monitoring: CAUM tracks how an agent operates, evaluating tool call diversity, trajectory geometry, and regime transitions.
  • Independent Recording: The core functionality is to record the actions of agents without making decisions on their behalf.

Performance Insights

With validation on 80,036 real agent sessions from the nebius/SWE-agent-trajectories dataset, CAUM demonstrates significant efficacy:

MetricValue
AUC @ step 100.741
AUC @ full session0.814
Cohen's d+0.977
Probability of failure under LOOP regime88.7%
LOOP detection F1 Score0.742
Average session length (failed vs. successful)2x longer
Estimated compute savings with 10K runs per day~$1.7M/year

Cross-model validation shows consistent results across various models without the requirement for retraining:

ModelAUC @ step 10AUC @ full
Llama 8B0.7300.816
Llama 70B0.6680.778
Llama 405B0.5840.776

Operational Mechanism

CAUM processes agent activity through:

Agent Steps → SBERT Embeddings → Trajectory Analysis → Regime Classification → UDS Score + Attestation

It leverages five structural signals for comprehensive analysis:

  • TCR: Tool Coherence Ratio - Assessing tool diversity.
  • ESR: Execution Substance Ratio - Measuring meaningfulness of steps.
  • SCI: Structural Coherence Index - Evaluating trajectory progress.
  • ZT Similarity: Zero-Trust cosine similarity between consecutive steps.
  • Regime: Classifying activity into EXPLORER, GRIND, STAGNATION, or LOOP.

Example Output

{
  "uds": 0.73,           # Unified Dynamic Score [0-1]
  "tier": "T2",          # Tier level, T1 (healthy) → T5 (critical)
  "dominant_regime": "EXPLORER",
  "shields_fired": 0,
  "advisory": {"level": 0, "level_name": "CLEAR"}
}

Utilization Scenarios

Use CAUM to:

  1. Analyze step-by-step actions of agents in real-time.
  2. Score JSONL files directly for batch analysis.

Quick Examples

To analyze steps:

from caum_monitor_v10 import ZeroTrustAuditor

auditor = ZeroTrustAuditor()

auditor.push("file_search", "find authentication.py in /src")
auditor.push("file_reader", "reading JWT validation at line 42")
auditor.push("code_editor", "fixed algorithm parameter in jwt.decode")
auditor.push("test_runner", "pytest — 12 passed in 1.2s")
auditor.push("submit", "PR #142 submitted")

report = auditor.finalize()
print(f"UDS={report['uds']} tier={report['tier']} regime={report['dominant_regime']}")

REST API Request Example

curl -X POST https://caum-observation-production.up.railway.app/v1/analyze \
  -H "Content-Type: application/json" \
  -d '{"trajectory": [{"tool": "search", "content": "..."}, ...]}'

Behavioral Regimes Overview

CAUM categorizes agent behavior into various regimes:

RegimeResolve RateDescription
EXPLORER67.8%Indicates healthy agent progress
WARMING_UP56.6%Early data stages with room for improvement
GRIND36.9%Warning of over-repetition while still advancing
LOOP11.3%High chance of failure, signaling critical issues

Deployment Flexibility

Choose appropriate deployment modes:

  • Forensic: Batch analysis of past sessions for compliance.
  • Live API: Real-time monitoring with alerting.
  • Enterprise SDK: Integrated within existing infrastructure with no data transfer.

For further exploration or to test capabilities, visit caum.systems/upload for a free structural observation report.

0 comments

No comments yet.

Sign in to be the first to comment.