Evaluators: Quality Assessment and Feedback¶

Overview¶

Evaluators are non-blocking quality assessment mechanisms that provide feedback on node execution quality. Unlike PolicyCheckers (which block execution) and PolicyRouters (which control flow), evaluators run after node execution to provide scores, feedback, and quality metrics without interrupting the workflow.

Evaluators are inspired by Reinforcement Learning (RL) reward functions, where each node execution receives a quality score that can be used for:

Monitoring: Track agent performance over time
Self-improvement: Provide feedback for reflection and refinement
Training: Generate reward signals for RL-based optimization
Debugging: Identify low-quality outputs for analysis

Key Characteristics¶

Non-Blocking Execution¶

Evaluators never interrupt or block execution, even when they detect low-quality outputs:

# Even if evaluator fails, execution continues
result = node_function(state)
evaluations = run_evaluators(history)  # Never raises exceptions
# Execution proceeds regardless of evaluation results

Post-Execution Timing¶

Evaluators run after a node completes:

1. Pre-execution: PolicyChecker validates preconditions
2. Execution: Node function runs
3. Post-execution: Evaluators assess quality ← HERE
4. Routing: PolicyRouter determines next step

Instruction Type Filtering¶

Evaluators can target specific instruction types using target_instructions:

# Only evaluate GENERATE nodes
evaluator = ThresholdEvaluator(
    name="confidence_check",
    key="confidence",
    threshold=0.7,
    target_instructions=[CognitiveCore.GENERATE]
)

# Evaluate multiple instruction types
evaluator = QualityEvaluator(
    name="quality",
    target_instructions=[CognitiveCore.GENERATE, CognitiveCore.REFLECT]
)

# Evaluate ALL nodes (default)
evaluator = UniversalEvaluator(name="universal")  # target_instructions=None

Comparison with Policy Components¶

Component	Timing	Blocks Execution?	Purpose	Output
PolicyChecker	Pre-execution	✓ Yes	Enforce constraints	Pass/Fail + violation message
PolicyRouter	Post-execution	✗ No (but controls flow)	Dynamic routing	Target node ID or None
Evaluator	Post-execution	✗ No	Quality assessment	Score + feedback

Built-in Evaluators¶

ThresholdEvaluator¶

Checks if a numeric value in the output state meets a threshold:

from arbiteros_alpha import ThresholdEvaluator
from arbiteros_alpha.instructions import CognitiveCore

evaluator = ThresholdEvaluator(
    name="confidence_check",
    key="confidence",           # Key in output_state to check
    threshold=0.7,              # Minimum acceptable value
    target_instructions=[CognitiveCore.GENERATE]
)

arbiter_os.add_evaluator(evaluator)

Behavior: - If output_state["confidence"] >= 0.7: passed=True, score=confidence - If output_state["confidence"] < 0.7: passed=False, score=confidence - If key missing: score=0.0, passed=False

Creating Custom Evaluators¶

Extend NodeEvaluator and implement the evaluate() method:

from arbiteros_alpha.evaluation import NodeEvaluator, EvaluationResult
from arbiteros_alpha.instructions import CognitiveCore

class ResponseLengthEvaluator(NodeEvaluator):
    """Evaluates response quality based on length."""

    def __init__(self, min_length: int = 50):
        super().__init__(
            name="response_length",
            target_instructions=[CognitiveCore.GENERATE]  # Only GENERATE
        )
        self.min_length = min_length

    def evaluate(self, history) -> EvaluationResult:
        # Access the most recent node execution
        current_item = history.entries[-1][-1]

        # Extract output
        response = current_item.output_state.get("response", "")
        length = len(response)

        # Calculate score (0.0 to 1.0)
        score = min(length / 100, 1.0)

        # Determine pass/fail
        passed = length >= self.min_length

        # Provide feedback
        feedback = f"Response length: {length} chars"

        return EvaluationResult(
            score=score,
            passed=passed,
            feedback=feedback
        )

Advanced: History-Aware Evaluators¶

Evaluators have access to the full execution history:

class ConsistencyEvaluator(NodeEvaluator):
    """Checks consistency across multiple responses."""

    def __init__(self):
        super().__init__(
            name="consistency_check",
            target_instructions=[CognitiveCore.GENERATE]
        )

    def evaluate(self, history) -> EvaluationResult:
        # Access all previous executions
        all_entries = [
            item for superstep in history.entries
            for item in superstep
        ]

        # Find all previous GENERATE nodes
        previous_generates = [
            item for item in all_entries[:-1]  # Exclude current
            if item.instruction == CognitiveCore.GENERATE
        ]

        if not previous_generates:
            return EvaluationResult(
                score=1.0,
                passed=True,
                feedback="First response, nothing to compare"
            )

        # Compare with previous responses
        current = all_entries[-1].output_state
        previous = previous_generates[-1].output_state

        # Check tone consistency
        current_tone = current.get("tone", "")
        previous_tone = previous.get("tone", "")

        consistent = current_tone == previous_tone
        score = 1.0 if consistent else 0.5

        return EvaluationResult(
            score=score,
            passed=consistent,
            feedback=f"Tone consistency: {'consistent' if consistent else 'inconsistent'}"
        )

Integration with History¶

Evaluation results are automatically stored in HistoryItem.evaluation_results:

# Execute node with evaluators
arbiter_os.add_evaluator(confidence_evaluator)
arbiter_os.add_evaluator(quality_evaluator)

state = {"query": "test"}
result = arbiter_os.execute(state, generate_fn)

# Access evaluation results from history
last_item = arbiter_os.history.entries[-1][-1]
evaluations = last_item.evaluation_results

for evaluator_name, eval_result in evaluations.items():
    print(f"{evaluator_name}: score={eval_result.score:.2f}, "
          f"passed={eval_result.passed}")

Pretty-Printed History¶

Use History.pprint() to display evaluations:

arbiter_os.history.pprint()

Output:

╔═══ SuperStep 1 ═══╗
  [1.1] GENERATE
    Evaluations:
      ✓ confidence_check: score=0.90 - confidence=0.90 (✓ threshold=0.7)
      ✓ response_quality: score=0.66 - Quality assessment: length=82 chars
      ✗ consistency_check: score=0.50 - Tone inconsistent with previous

Use Cases¶

1. Monitoring Agent Performance¶

Track quality metrics across conversations:

class PerformanceTracker(NodeEvaluator):
    def __init__(self):
        super().__init__(name="performance")
        self.scores = []

    def evaluate(self, history):
        score = calculate_quality(history.entries[-1][-1])
        self.scores.append(score)

        avg_score = sum(self.scores) / len(self.scores)

        return EvaluationResult(
            score=score,
            passed=score > 0.6,
            feedback=f"Current: {score:.2f}, Average: {avg_score:.2f}"
        )

2. Self-Reflection Triggers¶

Combine evaluators with routers for automatic reflection:

# Evaluator assesses quality
quality_evaluator = ThresholdEvaluator(
    name="quality_check",
    key="quality_score",
    threshold=0.7,
    target_instructions=[CognitiveCore.GENERATE]
)

# Router triggers reflection on low quality
class ReflectionRouter(PolicyRouter):
    def route_after(self, history, current_output):
        last_item = history.entries[-1][-1]
        quality_eval = last_item.evaluation_results.get("quality_check")

        if quality_eval and not quality_eval.passed:
            return "reflect_node"  # Trigger reflection
        return None  # Continue normal flow

arbiter_os.add_evaluator(quality_evaluator)
arbiter_os.add_policy_router(ReflectionRouter(name="auto_reflect"))

3. RL Training Signal Generation¶

Generate reward signals for reinforcement learning:

class RewardEvaluator(NodeEvaluator):
    """Generates RL rewards for training."""

    def __init__(self):
        super().__init__(name="rl_reward")
        self.episode_rewards = []

    def evaluate(self, history) -> EvaluationResult:
        current = history.entries[-1][-1]

        # Multi-factor reward calculation
        factors = {
            "correctness": self._check_correctness(current),
            "efficiency": self._check_efficiency(current),
            "safety": self._check_safety(current)
        }

        # Weighted reward
        reward = (
            0.5 * factors["correctness"] +
            0.3 * factors["efficiency"] +
            0.2 * factors["safety"]
        )

        self.episode_rewards.append(reward)

        return EvaluationResult(
            score=reward,
            passed=reward > 0.5,
            feedback=f"Reward: {reward:.2f} | Factors: {factors}"
        )

    def get_episode_reward(self) -> float:
        """Get cumulative reward for training."""
        return sum(self.episode_rewards)

Best Practices¶

1. Fail Gracefully¶

Always handle missing keys and exceptions:

def evaluate(self, history) -> EvaluationResult:
    try:
        current = history.entries[-1][-1]
        value = current.output_state.get("key", 0.0)  # Default value

        # ... evaluation logic ...

    except Exception as e:
        # Never crash - return neutral evaluation
        return EvaluationResult(
            score=0.0,
            passed=False,
            feedback=f"Evaluation error: {str(e)}"
        )

2. Use Instruction Filtering¶

Only evaluate nodes that produce relevant outputs:

# ✗ Bad: Evaluates all nodes, including VERIFY (may not have "response")
class ResponseEvaluator(NodeEvaluator):
    def __init__(self):
        super().__init__(name="response_eval")  # No filtering!

# ✓ Good: Only evaluates GENERATE nodes
class ResponseEvaluator(NodeEvaluator):
    def __init__(self):
        super().__init__(
            name="response_eval",
            target_instructions=[CognitiveCore.GENERATE]  # Filtered!
        )

3. Provide Actionable Feedback¶

Make feedback useful for debugging and improvement:

# ✗ Bad: Vague feedback
feedback = "Low quality"

# ✓ Good: Specific, actionable feedback
feedback = (
    f"Quality issues: length={len(response)} chars (min=50), "
    f"confidence={conf:.2f} (threshold=0.7), "
    f"tone={tone} (expected='formal')"
)

4. Normalize Scores¶

Keep scores in the 0.0-1.0 range for consistency:

def evaluate(self, history) -> EvaluationResult:
    raw_value = current.output_state.get("metric", 0)

    # Normalize to 0.0-1.0
    score = max(0.0, min(1.0, raw_value / 100))

    return EvaluationResult(score=score, ...)

API Reference¶

For detailed API documentation, see: