Evaluation API Reference¶
This page contains the auto-generated API documentation for the evaluation module.
The evaluation components are implemented in arbiteros_alpha.evaluation and re-exported from the package root for convenience.
EvaluationResult¶
EvaluationResult(score, passed, feedback, metadata=dict())
dataclass
¶
Result of evaluating a node execution.
Attributes:
| Name | Type | Description |
|---|---|---|
score |
float
|
Quality score in range [0.0, 1.0], where higher is better. |
passed |
bool
|
Whether the execution passed the evaluation criteria. |
feedback |
str
|
Human-readable explanation of the evaluation. |
metadata |
dict[str, Any]
|
Additional data for logging, RL training, etc. |
NodeEvaluator¶
NodeEvaluator(name, target_instructions=None)
¶
Bases: ABC
Base class for node execution evaluators.
Evaluators assess the quality of node executions after they complete. Unlike PolicyCheckers, evaluators do not block execution - they provide feedback and scores that can be used for monitoring, RL training, or self-improvement.
The evaluate() method receives the complete execution history, with the
current node as the last entry (history.entries[-1][-1]).
Attributes:
| Name | Type | Description |
|---|---|---|
name |
Unique identifier for this evaluator. |
|
target_instructions |
Optional list of instruction types to evaluate. If None, evaluates all nodes. If specified, only evaluates nodes matching the listed instruction types. |
Example
from arbiteros_alpha.instructions import CognitiveCore
class ResponseLengthEvaluator(NodeEvaluator):
def __init__(self):
super().__init__(
name="response_length",
target_instructions=[CognitiveCore.GENERATE]
)
def evaluate(self, history):
current = history.entries[-1][-1] # Get most recent node
response = current.output_state.get("response", "")
score = min(len(response) / 100, 1.0)
return EvaluationResult(
score=score,
passed=score > 0.5,
feedback=f"Response length: {len(response)} chars"
)
Initialize the evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Unique identifier for this evaluator. |
required |
target_instructions
|
list
|
Optional list of InstructionType enums to evaluate. If None (default), evaluates all node types. If provided, only evaluates nodes with matching instruction types. |
None
|
Source code in arbiteros_alpha/evaluation.py
evaluate(history)
abstractmethod
¶
Evaluate the most recent node execution.
This method is called after a node completes execution. The node's
HistoryItem (including output_state) has already been added to history
and can be accessed via history.entries[-1][-1].
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
history
|
History
|
Complete execution history. The current node is the last entry. |
required |
Returns:
| Type | Description |
|---|---|
EvaluationResult
|
EvaluationResult containing score, pass/fail status, and feedback. |
Note
- Simple evaluators can just examine the current node's input/output
- Complex evaluators can traverse the full history for context
- Evaluators should not raise exceptions; return low scores instead
Source code in arbiteros_alpha/evaluation.py
ThresholdEvaluator¶
ThresholdEvaluator(name, key, threshold, target_instructions=None)
¶
Bases: NodeEvaluator
Simple evaluator that checks if a metric exceeds a threshold.
This evaluator examines a specific key in the output state and compares its value against a threshold. Useful for basic quality checks.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
Inherited from NodeEvaluator. |
|
key |
The state key to evaluate. |
|
threshold |
Minimum value for the evaluation to pass. |
Example
Initialize the threshold evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Unique identifier for this evaluator. |
required |
key
|
str
|
Key in output_state to evaluate. |
required |
threshold
|
float
|
Minimum value for passing evaluation. |
required |
target_instructions
|
list
|
Optional list of instruction types to evaluate. |
None
|
Source code in arbiteros_alpha/evaluation.py
evaluate(history)
¶
Evaluate if the metric exceeds the threshold.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
history
|
History
|
Complete execution history. |
required |
Returns:
| Type | Description |
|---|---|
EvaluationResult
|
EvaluationResult with score equal to the metric value, |
EvaluationResult
|
passing if value >= threshold. |