Skip to content

Evaluators: Quality Assessment and Feedback

Overview

Evaluators are non-blocking quality assessment mechanisms that provide feedback on node execution quality. Unlike PolicyCheckers (which block execution) and PolicyRouters (which control flow), evaluators run after node execution to provide scores, feedback, and quality metrics without interrupting the workflow.

Evaluators are inspired by Reinforcement Learning (RL) reward functions, where each node execution receives a quality score that can be used for:

  • Monitoring: Track agent performance over time
  • Self-improvement: Provide feedback for reflection and refinement
  • Training: Generate reward signals for RL-based optimization
  • Debugging: Identify low-quality outputs for analysis

Key Characteristics

Non-Blocking Execution

Evaluators never interrupt or block execution, even when they detect low-quality outputs:

# Even if evaluator fails, execution continues
result = node_function(state)
evaluations = run_evaluators(history)  # Never raises exceptions
# Execution proceeds regardless of evaluation results

Post-Execution Timing

Evaluators run after a node completes:

1. Pre-execution: PolicyChecker validates preconditions
2. Execution: Node function runs
3. Post-execution: Evaluators assess quality ← HERE
4. Routing: PolicyRouter determines next step

Instruction Type Filtering

Evaluators can target specific instruction types using target_instructions:

# Only evaluate GENERATE nodes
evaluator = ThresholdEvaluator(
    name="confidence_check",
    key="confidence",
    threshold=0.7,
    target_instructions=[CognitiveCore.GENERATE]
)

# Evaluate multiple instruction types
evaluator = QualityEvaluator(
    name="quality",
    target_instructions=[CognitiveCore.GENERATE, CognitiveCore.REFLECT]
)

# Evaluate ALL nodes (default)
evaluator = UniversalEvaluator(name="universal")  # target_instructions=None

Comparison with Policy Components

Component Timing Blocks Execution? Purpose Output
PolicyChecker Pre-execution ✓ Yes Enforce constraints Pass/Fail + violation message
PolicyRouter Post-execution ✗ No (but controls flow) Dynamic routing Target node ID or None
Evaluator Post-execution ✗ No Quality assessment Score + feedback

Built-in Evaluators

ThresholdEvaluator

Checks if a numeric value in the output state meets a threshold:

from arbiteros_alpha import ThresholdEvaluator
from arbiteros_alpha.instructions import CognitiveCore

evaluator = ThresholdEvaluator(
    name="confidence_check",
    key="confidence",           # Key in output_state to check
    threshold=0.7,              # Minimum acceptable value
    target_instructions=[CognitiveCore.GENERATE]
)

arbiter_os.add_evaluator(evaluator)

Behavior: - If output_state["confidence"] >= 0.7: passed=True, score=confidence - If output_state["confidence"] < 0.7: passed=False, score=confidence - If key missing: score=0.0, passed=False

Creating Custom Evaluators

Extend NodeEvaluator and implement the evaluate() method:

from arbiteros_alpha.evaluation import NodeEvaluator, EvaluationResult
from arbiteros_alpha.instructions import CognitiveCore

class ResponseLengthEvaluator(NodeEvaluator):
    """Evaluates response quality based on length."""

    def __init__(self, min_length: int = 50):
        super().__init__(
            name="response_length",
            target_instructions=[CognitiveCore.GENERATE]  # Only GENERATE
        )
        self.min_length = min_length

    def evaluate(self, history) -> EvaluationResult:
        # Access the most recent node execution
        current_item = history.entries[-1][-1]

        # Extract output
        response = current_item.output_state.get("response", "")
        length = len(response)

        # Calculate score (0.0 to 1.0)
        score = min(length / 100, 1.0)

        # Determine pass/fail
        passed = length >= self.min_length

        # Provide feedback
        feedback = f"Response length: {length} chars"

        return EvaluationResult(
            score=score,
            passed=passed,
            feedback=feedback
        )

Advanced: History-Aware Evaluators

Evaluators have access to the full execution history:

class ConsistencyEvaluator(NodeEvaluator):
    """Checks consistency across multiple responses."""

    def __init__(self):
        super().__init__(
            name="consistency_check",
            target_instructions=[CognitiveCore.GENERATE]
        )

    def evaluate(self, history) -> EvaluationResult:
        # Access all previous executions
        all_entries = [
            item for superstep in history.entries
            for item in superstep
        ]

        # Find all previous GENERATE nodes
        previous_generates = [
            item for item in all_entries[:-1]  # Exclude current
            if item.instruction == CognitiveCore.GENERATE
        ]

        if not previous_generates:
            return EvaluationResult(
                score=1.0,
                passed=True,
                feedback="First response, nothing to compare"
            )

        # Compare with previous responses
        current = all_entries[-1].output_state
        previous = previous_generates[-1].output_state

        # Check tone consistency
        current_tone = current.get("tone", "")
        previous_tone = previous.get("tone", "")

        consistent = current_tone == previous_tone
        score = 1.0 if consistent else 0.5

        return EvaluationResult(
            score=score,
            passed=consistent,
            feedback=f"Tone consistency: {'consistent' if consistent else 'inconsistent'}"
        )

Integration with History

Evaluation results are automatically stored in HistoryItem.evaluation_results:

# Execute node with evaluators
arbiter_os.add_evaluator(confidence_evaluator)
arbiter_os.add_evaluator(quality_evaluator)

state = {"query": "test"}
result = arbiter_os.execute(state, generate_fn)

# Access evaluation results from history
last_item = arbiter_os.history.entries[-1][-1]
evaluations = last_item.evaluation_results

for evaluator_name, eval_result in evaluations.items():
    print(f"{evaluator_name}: score={eval_result.score:.2f}, "
          f"passed={eval_result.passed}")

Pretty-Printed History

Use History.pprint() to display evaluations:

arbiter_os.history.pprint()

Output:

╔═══ SuperStep 1 ═══╗
  [1.1] GENERATE
    Evaluations:
      ✓ confidence_check: score=0.90 - confidence=0.90 (✓ threshold=0.7)
      ✓ response_quality: score=0.66 - Quality assessment: length=82 chars
      ✗ consistency_check: score=0.50 - Tone inconsistent with previous

Use Cases

1. Monitoring Agent Performance

Track quality metrics across conversations:

class PerformanceTracker(NodeEvaluator):
    def __init__(self):
        super().__init__(name="performance")
        self.scores = []

    def evaluate(self, history):
        score = calculate_quality(history.entries[-1][-1])
        self.scores.append(score)

        avg_score = sum(self.scores) / len(self.scores)

        return EvaluationResult(
            score=score,
            passed=score > 0.6,
            feedback=f"Current: {score:.2f}, Average: {avg_score:.2f}"
        )

2. Self-Reflection Triggers

Combine evaluators with routers for automatic reflection:

# Evaluator assesses quality
quality_evaluator = ThresholdEvaluator(
    name="quality_check",
    key="quality_score",
    threshold=0.7,
    target_instructions=[CognitiveCore.GENERATE]
)

# Router triggers reflection on low quality
class ReflectionRouter(PolicyRouter):
    def route_after(self, history, current_output):
        last_item = history.entries[-1][-1]
        quality_eval = last_item.evaluation_results.get("quality_check")

        if quality_eval and not quality_eval.passed:
            return "reflect_node"  # Trigger reflection
        return None  # Continue normal flow

arbiter_os.add_evaluator(quality_evaluator)
arbiter_os.add_policy_router(ReflectionRouter(name="auto_reflect"))

3. RL Training Signal Generation

Generate reward signals for reinforcement learning:

class RewardEvaluator(NodeEvaluator):
    """Generates RL rewards for training."""

    def __init__(self):
        super().__init__(name="rl_reward")
        self.episode_rewards = []

    def evaluate(self, history) -> EvaluationResult:
        current = history.entries[-1][-1]

        # Multi-factor reward calculation
        factors = {
            "correctness": self._check_correctness(current),
            "efficiency": self._check_efficiency(current),
            "safety": self._check_safety(current)
        }

        # Weighted reward
        reward = (
            0.5 * factors["correctness"] +
            0.3 * factors["efficiency"] +
            0.2 * factors["safety"]
        )

        self.episode_rewards.append(reward)

        return EvaluationResult(
            score=reward,
            passed=reward > 0.5,
            feedback=f"Reward: {reward:.2f} | Factors: {factors}"
        )

    def get_episode_reward(self) -> float:
        """Get cumulative reward for training."""
        return sum(self.episode_rewards)

Best Practices

1. Fail Gracefully

Always handle missing keys and exceptions:

def evaluate(self, history) -> EvaluationResult:
    try:
        current = history.entries[-1][-1]
        value = current.output_state.get("key", 0.0)  # Default value

        # ... evaluation logic ...

    except Exception as e:
        # Never crash - return neutral evaluation
        return EvaluationResult(
            score=0.0,
            passed=False,
            feedback=f"Evaluation error: {str(e)}"
        )

2. Use Instruction Filtering

Only evaluate nodes that produce relevant outputs:

# ✗ Bad: Evaluates all nodes, including VERIFY (may not have "response")
class ResponseEvaluator(NodeEvaluator):
    def __init__(self):
        super().__init__(name="response_eval")  # No filtering!

# ✓ Good: Only evaluates GENERATE nodes
class ResponseEvaluator(NodeEvaluator):
    def __init__(self):
        super().__init__(
            name="response_eval",
            target_instructions=[CognitiveCore.GENERATE]  # Filtered!
        )

3. Provide Actionable Feedback

Make feedback useful for debugging and improvement:

# ✗ Bad: Vague feedback
feedback = "Low quality"

# ✓ Good: Specific, actionable feedback
feedback = (
    f"Quality issues: length={len(response)} chars (min=50), "
    f"confidence={conf:.2f} (threshold=0.7), "
    f"tone={tone} (expected='formal')"
)

4. Normalize Scores

Keep scores in the 0.0-1.0 range for consistency:

def evaluate(self, history) -> EvaluationResult:
    raw_value = current.output_state.get("metric", 0)

    # Normalize to 0.0-1.0
    score = max(0.0, min(1.0, raw_value / 100))

    return EvaluationResult(score=score, ...)

API Reference

For detailed API documentation, see: