Complete Tutorial: AI Assistant with Policy Governance¶

This tutorial walks through a complete working example that demonstrates all key features of ArbiterOS-alpha. We'll build an AI assistant with quality control through policy-driven governance.

Overview¶

The example implements an AI assistant workflow with:

Policy Checking: Prevents forbidden instruction sequences
Dynamic Routing: Automatically retries when quality is low
Execution Tracking: Full history with timestamps and I/O
LangGraph Integration: Standard LangGraph patterns with governance

Complete Code¶

Here's the full examples/main.py:

import logging
from typing import TypedDict

from langgraph.graph import END, START, StateGraph
from rich.logging import RichHandler

from arbiteros_alpha import ArbiterOSAlpha
from arbiteros_alpha.policy import HistoryPolicyChecker, MetricThresholdPolicyRouter
from arbiteros_alpha.instructions import GENERATE, TOOL_CALL, EVALUATE

logger = logging.getLogger(__name__)

logging.basicConfig(
    level=logging.DEBUG,
    handlers=[RichHandler()],
)

# 1. Setup OS
arbiter_os = ArbiterOSAlpha()

# Policy: Prevent direct generate->toolcall without proper flow
history_checker = HistoryPolicyChecker(
    name="no_direct_toolcall", bad_sequence=[GENERATE, TOOL_CALL]
)

# Policy: If confidence is low, regenerate the response
confidence_router = MetricThresholdPolicyRouter(
    name="regenerate_on_low_confidence",
    key="confidence",
    threshold=0.6,
    target="generate",
)

arbiter_os.add_policy_checker(history_checker)
arbiter_os.add_policy_router(confidence_router)

# 2. Define State
class State(TypedDict):
    """State for a simple AI assistant with tool usage and self-evaluation."""
    query: str
    response: str
    tool_result: str
    confidence: float

# 3. Define Instructions
@arbiter_os.instruction(GENERATE)
def generate(state: State) -> dict:
    """Generate a response to the user query."""
    is_retry = bool(state.get("response"))

    if is_retry:
        response = "Here is my comprehensive and detailed response with much more content and explanation."
    else:
        response = "Short reply."

    return {"response": response}

@arbiter_os.instruction("toolcall")
def tool_call(state: State) -> dict:
    """Call external tools to enhance the response."""
    return {"tool_result": "ok"}

@arbiter_os.instruction("evaluate")
def evaluate(state: State) -> dict:
    """Evaluate confidence in the response quality."""
    response_length = len(state["response"])
    confidence = min(response_length / 100.0, 1.0)
    return {"confidence": confidence}

# 4. Build LangGraph
builder = StateGraph(State)
builder.add_node(GENERATE, generate)
builder.add_node(TOOL_CALL, tool_call)
builder.add_node(EVALUATE, evaluate)

builder.add_edge(START, GENERATE)
builder.add_edge(GENERATE, TOOL_CALL)
builder.add_edge(TOOL_CALL, EVALUATE)
builder.add_edge(EVALUATE, END)

graph = builder.compile()

# 5. Run
initial_state: State = {
    "query": "What is AI?",
    "response": "",
    "tool_result": "",
    "confidence": 0.0,
}

for chunk in graph.stream(initial_state, stream_mode="values", debug=False):
    logger.info(f"Current state: {chunk}\n")

# 6. View History
from arbiteros_alpha import print_history
print_history(arbiter_os.history)

Step-by-Step Explanation¶

Step 1: Setup ArbiterOS and Policies¶

os = ArbiterOSAlpha()

# Policy 1: Sequence Validation
history_checker = HistoryPolicyChecker(
    name="no_direct_toolcall", 
    bad_sequence=["generate", "toolcall"]
)

# Policy 2: Quality-Based Routing
confidence_router = MetricThresholdPolicyRouter(
    name="regenerate_on_low_confidence",
    key="confidence",
    threshold=0.6,
    target="generate",
)

arbiter_os.add_policy_checker(history_checker)
arbiter_os.add_policy_router(confidence_router)

What's happening:

ArbiterOSAlpha(): Creates the governance coordinator
HistoryPolicyChecker: Monitors instruction sequences and flags when generate→toolcall pattern occurs (though it allows execution to continue)
MetricThresholdPolicyRouter: Watches the confidence metric; if it's below 0.6, routes back to generate for a retry
These policies are registered with the OS instance

Step 2: Define State Schema¶

class State(TypedDict):
    """State for a simple AI assistant with tool usage and self-evaluation."""
    query: str
    response: str
    tool_result: str
    confidence: float

What's happening:

Standard LangGraph state definition. The state flows through all nodes and accumulates information:

query: User's question
response: Generated answer
tool_result: Result from tool execution
confidence: Quality score (0.0 to 1.0)

Step 3: Define Instruction Functions¶

Generate Instruction¶

@arbiter_os.instruction("generate")
def generate(state: State) -> dict:
    """Generate a response to the user query."""
    is_retry = bool(state.get("response"))

    if is_retry:
        response = "Here is my comprehensive and detailed response with much more content and explanation."
    else:
        response = "Short reply."

    return {"response": response}

What's happening:

The @os.instruction("generate") decorator wraps the function with governance
First call: Returns a short response (simulating low-quality output)
Retry call: Detects existing response and generates a longer, better one
Returns only the fields that changed (partial state update)

Tool Call Instruction¶

@os.instruction(TOOL_CALL)
def tool_call(state: State) -> dict:
    """Call external tools to enhance the response."""
    return {"tool_result": "ok"}

What's happening:

Simulates calling external tools (APIs, databases, etc.)
The decorator tracks this execution and checks policies
Returns the tool result

Evaluate Instruction¶

@os.instruction(EVALUATE)
def evaluate(state: State) -> dict:
    """Evaluate confidence in the response quality."""
    response_length = len(state["response"])
    confidence = min(response_length / 100.0, 1.0)
    return {"confidence": confidence}

What's happening:

Calculates quality metric based on response length
Short responses (< 60 chars) get confidence < 0.6 (triggers retry)
Longer responses (>= 60 chars) get confidence >= 0.6 (passes)
This is where the routing decision is made

Step 4: Build LangGraph¶

builder = StateGraph(State)
builder.add_node(generate)
builder.add_node(tool_call)
builder.add_node(evaluate)

builder.add_edge(START, "generate")
builder.add_edge("generate", "tool_call")
builder.add_edge("tool_call", "evaluate")
builder.add_edge("evaluate", END)

graph = builder.compile()

What's happening:

Standard LangGraph construction:

START → generate → tool_call → evaluate → END
           ↑__________________________|
           (routes back if confidence < 0.6)

The routing from evaluate back to generate happens dynamically through the MetricThresholdPolicyRouter, not through static edges.

Step 5: Execute¶

initial_state: State = {
    "query": "What is AI?",
    "response": "",
    "tool_result": "",
    "confidence": 0.0,
}

for chunk in graph.stream(initial_state, stream_mode="values", debug=False):
    logger.info(f"Current state: {chunk}\n")

What's happening:

Starts with empty response and zero confidence
Streams through the graph, printing state at each step
ArbiterOS governance runs automatically at each instruction

Step 6: View History¶

arbiter_os.history.pprint()

Displays formatted execution history with all decisions and state changes.

Execution Flow¶

The actual execution follows this path:

First Iteration (Low Quality)¶

GENERATE (attempt #1)
Input: {query: "What is AI?", response: "", ...}
Output: {response: "Short reply."}
Policy Check: ✓ No violations (first call)
TOOL_CALL
Input: {..., response: "Short reply.", ...}
Output: {tool_result: "ok"}
Policy Check: ✗ Detects GENERATE→TOOL_CALL sequence (flagged but continues)
EVALUATE
Input: {..., response: "Short reply.", tool_result: "ok", ...}
Output: {confidence: 0.12} (13 chars / 100 = 0.13)
Policy Check: ✗ Still has GENERATE→TOOL_CALL in history
Policy Route: ⚡ confidence < 0.6 → Routes to generate

Second Iteration (High Quality)¶

GENERATE (attempt #2 - retry)
Input: {..., response: "Short reply.", ...} (response exists)
Output: {response: "Here is my comprehensive and detailed response with much more content and explanation."}
Policy Check: ✗ Still has old GENERATE→TOOL_CALL in history
TOOL_CALL
Input: {..., response: "Here is my comprehensive...", ...}
Output: {tool_result: "ok"}
Policy Check: ✗ Multiple GENERATE→TOOL_CALL sequences now
EVALUATE
Input: {..., response: "Here is my comprehensive...", ...}
Output: {confidence: 0.86} (86 chars / 100 = 0.86)
Policy Check: ✗ Multiple violations in history
Policy Route: ✓ confidence >= 0.6 → No routing, continues to END

Final State¶

{
    "query": "What is AI?",
    "response": "Here is my comprehensive and detailed response with much more content and explanation.",
    "tool_result": "ok",
    "confidence": 0.86
}

Example Output¶

When you run uv run -m examples.main, you'll see:

Console Logs¶

[DEBUG] Adding policy checker: HistoryPolicyChecker(name='no_direct_toolcall', bad_sequence='GENERATE->TOOL_CALL')
[DEBUG] Adding policy router: MetricThresholdPolicyRouter(name='regenerate_on_low_confidence', key='confidence', threshold=0.6, target='generate')

[DEBUG] Executing instruction: GENERATE
[DEBUG] Running 1 policy checkers (before)
[DEBUG] Instruction GENERATE returned: {'response': 'Short reply.'}
[DEBUG] Checking 1 policy routers

[DEBUG] Executing instruction: TOOL_CALL
[DEBUG] Running 1 policy checkers (before)
[ERROR] Blacklisted sequence detected: no_direct_toolcall:[GENERATE->TOOL_CALL] in [GENERATE->TOOL_CALL]
[ERROR] Policy checker HistoryPolicyChecker(...) failed validation.
[DEBUG] Instruction TOOL_CALL returned: {'tool_result': 'ok'}

[DEBUG] Executing instruction: EVALUATE
[DEBUG] Instruction EVALUATE returned: {'confidence': 0.12}
[WARNING] Routing decision made to: generate
[INFO] Routing from evaluate to generate

[DEBUG] Executing instruction: GENERATE
[DEBUG] Instruction GENERATE returned: {'response': 'Here is my comprehensive...'}
[DEBUG] Instruction EVALUATE returned: {'confidence': 0.86}

Execution History¶

📋 Arbiter OS Execution History
================================================================================

[1] GENERATE
  Timestamp: 2025-11-05 10:12:24.659058
  Input:
    query: What is AI?
    response: ''
    tool_result: ''
    confidence: 0.0
  Output:
    response: Short reply.
  Policy Checks:
    (none)
  Policy Routes:
    (none)

[2] TOOL_CALL
  Timestamp: 2025-11-05 10:12:24.662379
  Input:
    query: What is AI?
    response: Short reply.
    tool_result: ''
    confidence: 0.0
  Output:
    tool_result: ok
  Policy Checks:
    ✗ no_direct_toolcall
  Policy Routes:
    (none)

[3] EVALUATE
  Timestamp: 2025-11-05 10:12:24.666841
  Input:
    query: What is AI?
    response: Short reply.
    tool_result: ok
    confidence: 0.0
  Output:
    confidence: 0.12
  Policy Checks:
    ✗ no_direct_toolcall
  Policy Routes:
    → regenerate_on_low_confidence ⇒ generate

[4] GENERATE
  Timestamp: 2025-11-05 10:12:24.673606
  Input:
    query: What is AI?
    response: Short reply.
    tool_result: ok
    confidence: 0.12
  Output:
    response: Here is my comprehensive and detailed response with much more content and
      explanation.
  Policy Checks:
    ✗ no_direct_toolcall
  Policy Routes:
    (none)

[5] TOOL_CALL
  Timestamp: 2025-11-05 10:12:24.679333
  Input:
    query: What is AI?
    response: Here is my comprehensive and detailed response with much more content and
      explanation.
    tool_result: ok
    confidence: 0.12
  Output:
    tool_result: ok
  Policy Checks:
    ✗ no_direct_toolcall
  Policy Routes:
    (none)

[6] EVALUATE
  Timestamp: 2025-11-05 10:12:24.683659
  Input:
    query: What is AI?
    response: Here is my comprehensive and detailed response with much more content and
      explanation.
    tool_result: ok
    confidence: 0.12
  Output:
    confidence: 0.86
  Policy Checks:
    ✗ no_direct_toolcall
  Policy Routes:
    (none)

================================================================================

Key Insights¶

1. Policy Checking vs Routing¶

Policy Checkers (HistoryPolicyChecker): Detect and flag violations but don't block execution
Policy Routers (MetricThresholdPolicyRouter): Actively redirect execution flow

2. Automatic Retry Pattern¶

The router implements an automatic quality control loop:

Low Quality (confidence < 0.6) → Retry
High Quality (confidence >= 0.6) → Continue

No manual retry logic needed in your code!

3. Full Observability¶

Every execution is tracked:

✅ What ran, when, and with what inputs/outputs
✅ Which policies triggered
✅ Why routing decisions were made
✅ Complete audit trail

4. LangGraph Compatibility¶

Notice how the LangGraph code is completely standard:

builder = StateGraph(State)
builder.add_node(generate)  # Decorated function works seamlessly
builder.add_edge(START, "generate")
graph = builder.compile()

Governance is added through decorators, not by changing the graph structure.

Running the Example¶

# From project root
uv run -m examples.main

Customization Ideas¶

Try modifying the example:

Change the threshold: Set threshold=0.8 for stricter quality control
Add more policies: Create a MaxRetriesPolicyRouter to prevent infinite loops
Different metrics: Monitor response length, keyword presence, or API scores
Multiple routers: Add routing for different failure modes

Next Steps¶

Explore the API Reference for all available options
Read the Quick Start Guide for more examples
Check the Installation Guide for setup details