Skip to content

Evaluation API Reference

This page contains the auto-generated API documentation for the evaluation module.

The evaluation components are implemented in arbiteros_alpha.evaluation and re-exported from the package root for convenience.

EvaluationResult

EvaluationResult(score, passed, feedback, metadata=dict()) dataclass

Result of evaluating a node execution.

Attributes:

Name Type Description
score float

Quality score in range [0.0, 1.0], where higher is better.

passed bool

Whether the execution passed the evaluation criteria.

feedback str

Human-readable explanation of the evaluation.

metadata dict[str, Any]

Additional data for logging, RL training, etc.

NodeEvaluator

NodeEvaluator(name, target_instructions=None)

Bases: ABC

Base class for node execution evaluators.

Evaluators assess the quality of node executions after they complete. Unlike PolicyCheckers, evaluators do not block execution - they provide feedback and scores that can be used for monitoring, RL training, or self-improvement.

The evaluate() method receives the complete execution history, with the current node as the last entry (history.entries[-1][-1]).

Attributes:

Name Type Description
name

Unique identifier for this evaluator.

target_instructions

Optional list of instruction types to evaluate. If None, evaluates all nodes. If specified, only evaluates nodes matching the listed instruction types.

Example
from arbiteros_alpha.instructions import CognitiveCore

class ResponseLengthEvaluator(NodeEvaluator):
    def __init__(self):
        super().__init__(
            name="response_length",
            target_instructions=[CognitiveCore.GENERATE]
        )

    def evaluate(self, history):
        current = history.entries[-1][-1]  # Get most recent node
        response = current.output_state.get("response", "")
        score = min(len(response) / 100, 1.0)
        return EvaluationResult(
            score=score,
            passed=score > 0.5,
            feedback=f"Response length: {len(response)} chars"
        )

Initialize the evaluator.

Parameters:

Name Type Description Default
name str

Unique identifier for this evaluator.

required
target_instructions list

Optional list of InstructionType enums to evaluate. If None (default), evaluates all node types. If provided, only evaluates nodes with matching instruction types.

None
Source code in arbiteros_alpha/evaluation.py
def __init__(self, name: str, target_instructions: list = None):
    """Initialize the evaluator.

    Args:
        name: Unique identifier for this evaluator.
        target_instructions: Optional list of InstructionType enums to evaluate.
            If None (default), evaluates all node types.
            If provided, only evaluates nodes with matching instruction types.
    """
    self.name = name
    self.target_instructions = target_instructions

evaluate(history) abstractmethod

Evaluate the most recent node execution.

This method is called after a node completes execution. The node's HistoryItem (including output_state) has already been added to history and can be accessed via history.entries[-1][-1].

Parameters:

Name Type Description Default
history History

Complete execution history. The current node is the last entry.

required

Returns:

Type Description
EvaluationResult

EvaluationResult containing score, pass/fail status, and feedback.

Note
  • Simple evaluators can just examine the current node's input/output
  • Complex evaluators can traverse the full history for context
  • Evaluators should not raise exceptions; return low scores instead
Source code in arbiteros_alpha/evaluation.py
@abstractmethod
def evaluate(self, history: "History") -> EvaluationResult:
    """Evaluate the most recent node execution.

    This method is called after a node completes execution. The node's
    HistoryItem (including output_state) has already been added to history
    and can be accessed via `history.entries[-1][-1]`.

    Args:
        history: Complete execution history. The current node is the last entry.

    Returns:
        EvaluationResult containing score, pass/fail status, and feedback.

    Note:
        - Simple evaluators can just examine the current node's input/output
        - Complex evaluators can traverse the full history for context
        - Evaluators should not raise exceptions; return low scores instead
    """
    pass

ThresholdEvaluator

ThresholdEvaluator(name, key, threshold, target_instructions=None)

Bases: NodeEvaluator

Simple evaluator that checks if a metric exceeds a threshold.

This evaluator examines a specific key in the output state and compares its value against a threshold. Useful for basic quality checks.

Attributes:

Name Type Description
name

Inherited from NodeEvaluator.

key

The state key to evaluate.

threshold

Minimum value for the evaluation to pass.

Example
evaluator = ThresholdEvaluator(
    name="confidence_check",
    key="confidence",
    threshold=0.7
)
# Will check if output_state["confidence"] >= 0.7

Initialize the threshold evaluator.

Parameters:

Name Type Description Default
name str

Unique identifier for this evaluator.

required
key str

Key in output_state to evaluate.

required
threshold float

Minimum value for passing evaluation.

required
target_instructions list

Optional list of instruction types to evaluate.

None
Source code in arbiteros_alpha/evaluation.py
def __init__(
    self, name: str, key: str, threshold: float, target_instructions: list = None
):
    """Initialize the threshold evaluator.

    Args:
        name: Unique identifier for this evaluator.
        key: Key in output_state to evaluate.
        threshold: Minimum value for passing evaluation.
        target_instructions: Optional list of instruction types to evaluate.
    """
    super().__init__(name, target_instructions)
    self.key = key
    self.threshold = threshold

evaluate(history)

Evaluate if the metric exceeds the threshold.

Parameters:

Name Type Description Default
history History

Complete execution history.

required

Returns:

Type Description
EvaluationResult

EvaluationResult with score equal to the metric value,

EvaluationResult

passing if value >= threshold.

Source code in arbiteros_alpha/evaluation.py
def evaluate(self, history: "History") -> EvaluationResult:
    """Evaluate if the metric exceeds the threshold.

    Args:
        history: Complete execution history.

    Returns:
        EvaluationResult with score equal to the metric value,
        passing if value >= threshold.
    """
    current_item = history.entries[-1][-1]
    value = current_item.output_state.get(self.key, 0.0)
    passed = value >= self.threshold

    return EvaluationResult(
        score=value,
        passed=passed,
        feedback=f"{self.key}={value:.2f} ({'✓' if passed else '✗'} threshold={self.threshold})",
        metadata={"key": self.key, "threshold": self.threshold, "value": value},
    )