Your AI Agent Isn't as Good as You Think: How to Actually Evaluate It

You build an agent. It answers a few questions correctly. You ship it. Then someone changes the system prompt, swaps the model from GPT-4o to Gemini Flash to save costs, and suddenly the agent stops calling the right tools. Nobody notices for two weeks because there's no test suite, no regression check, nothing.

This is the default state of most agent deployments. We obsess over prompt engineering and model selection but have zero visibility into whether the agent is actually doing what it's supposed to.

I ran into this building Pori, an open-source agent framework. The agent worked great in my terminal. Then I deployed it as an API, changed a few things, and it silently degraded. I had no way to catch it. So I built an evaluation system directly into the framework.

The Four Things You Need to Measure

After digging into how production AI systems actually break, I landed on four evaluation types. Each one catches a different class of failure:

Reliability  →  Did the agent use the right tools?
Accuracy     →  Is the answer correct?
Performance  →  Is it fast enough? How much does it cost?
Safety       →  Did it say something it shouldn't have?

Most teams only check accuracy, "does it give the right answer?", and even that is usually manual. But agents fail in ways that accuracy alone can't catch.

Reliability: The Tools Test

This is the most underrated eval. You don't need an LLM judge for it. It's purely deterministic.

The question is simple: given a task, did the agent call the tools you expected?

from pori.eval import ReliabilityEval
 
eval = ReliabilityEval(
    agent=my_agent,
    expected_tool_calls=["web_search", "answer"],
)
result = await eval.run()
result.assert_passed()
# result.failed_tool_calls → ["web_search"]  (if it didn't search)

Why this matters: I changed a system prompt once and the agent stopped using web_search entirely. It started hallucinating answers instead of looking them up. The answers sounded correct; an accuracy eval might even pass them, but the underlying behavior was broken.

Reliability evals catch this instantly. No LLM judge needed, no cost, runs in milliseconds. You should run these in CI on every prompt change.

The implementation is straightforward: run the agent, inspect agent.memory.tool_call_history, compare against expected tools:

actual_tools = [tc.tool_name for tc in agent.memory.tool_call_history]
failed = [t for t in expected_tool_calls if t not in actual_tools]

Accuracy: The LLM Judge

This one needs a separate LLM to score the agent's output. You give it the task, the expected answer, and the agent's actual answer, and it scores from 1-10.

from pori.eval import AccuracyEval
 
eval = AccuracyEval(
    agent=my_agent,
    expected_output="Paris is the capital of France",
    evaluator_llm=judge_model,
    num_iterations=3,  # Run 3 times, average the scores
    threshold=7.0,     # Fail if average below 7
)
result = await eval.run()
print(f"Average score: {result.avg_score}/10")
print(f"Reason: {result.reason}")

The key design decision: use structured output for the judge. Don't ask it to "rate on a scale of 1-10" in free text; you'll get inconsistent formatting. Instead, force it into a schema:

class AccuracyScore(BaseModel):
    score: int = Field(..., ge=1, le=10)
    reason: str = Field(..., description="Reasoning for the score")

With with_structured_output(), the judge always returns a parseable score and explanation. No regex parsing, no "the score is approximately 7 out of 10" nonsense.

The num_iterations parameter matters. LLM judges are noisy; the same input can get a 7 one run and a 9 the next. Running multiple iterations and averaging smooths this out. Three is usually enough to get a stable signal.

Performance: The Benchmark

This one is obvious but almost nobody does it systematically. How long does your agent take? How much memory does it use? What's the p95 latency?

from pori.eval import PerformanceEval
 
eval = PerformanceEval(
    func=lambda: agent.run(),
    num_iterations=10,
    warmup_runs=2,
    measure_memory=True,
)
result = await eval.run()
 
print(f"Average: {result.avg_run_time:.2f}s")
print(f"p95: {result.p95_run_time:.2f}s")
print(f"Memory: {result.avg_memory:.1f} MiB")

I use this when switching models. Moving from Claude Sonnet to Gemini Flash cut my average latency by 40% but increased step count by 2x (Flash needed more tool calls to reach the same answer). Without benchmarking both, I would have assumed Flash was just "faster" without realizing the agent was compensating with extra steps.

The warmup runs matter: first invocations load models, warm caches, and establish connections. You don't want that noise in your numbers.

Agent-as-Judge: Custom Criteria

The most flexible eval. You define the criteria in plain English, and an LLM judges whether the output passes.

from pori.eval import AgentJudgeEval
 
eval = AgentJudgeEval(
    criteria="The response must cite at least one source and be under 200 words",
    judge_llm=judge_model,
    scoring="binary",  # PASS or FAIL
)
result = await eval.run(
    input="What causes inflation?",
    output=agent_response,
)

You can also use numeric scoring with a threshold:

eval = AgentJudgeEval(
    criteria="Response should be technically accurate and well-structured",
    judge_llm=judge_model,
    scoring="numeric",  # 1-10
    threshold=7,
)

This is where evaluation becomes product-specific. "Is it professional?" "Does it follow our style guide?" "Does it avoid financial advice?" These aren't things a generic accuracy eval can check.

From Eval to Guardrail

Here's the insight that changed how I think about this: an eval and a guardrail are the same interface. An eval runs after the fact to check quality. A guardrail runs at request time to block bad inputs or outputs.

Same pre_check / post_check methods, same pass/fail logic. The only difference is when you call them.

from pori import Agent
from pori.eval import ContentPolicyGuardrail, TopicGuardrail
 
agent = Agent(
    task="...",
    llm=llm,
    tools_registry=registry,
    guardrails=[
        ContentPolicyGuardrail(judge_llm=llm),
        TopicGuardrail(
            allowed_topics=["technology", "science"],
            judge_llm=llm,
        ),
    ],
)
 
result = await agent.run()

If the input fails pre_check, the agent never runs:

{
  "completed": false,
  "blocked_by": "input_guardrail",
  "reason": "Input contains a request for illegal activities"
}

If the output fails post_check, the response is blocked before reaching the user:

{
  "completed": false,
  "blocked_by": "output_guardrail",
  "reason": "Output contains personally identifiable information"
}

The agent ran, consumed tokens, did the work, but the unsafe output never leaves the system. This is table stakes for any production deployment.

Building Your Own Guardrail

Every guardrail is just an AgentJudgeEval with specific criteria:

from pori.eval import AgentJudgeEval
 
class ComplianceGuardrail(AgentJudgeEval):
    def __init__(self, judge_llm):
        super().__init__(
            criteria=(
                "The output must not provide specific financial advice, "
                "medical diagnoses, or legal recommendations. "
                "General information is acceptable."
            ),
            judge_llm=judge_llm,
            scoring="binary",
            name="compliance",
        )

Attach it to the agent, and it runs automatically on every response. No separate infrastructure, no external service, no latency from a separate API call; it's just another LLM call using the same model.

What I'd Recommend

If you're deploying agents and don't have evals, start here:

Reliability evals in CI: define expected tool calls for your 10 most common tasks. Run on every prompt/model change. Zero cost, instant feedback.
One accuracy eval per critical path: pick the task that matters most. Write an expected output. Run AccuracyEval with num_iterations=3. Set a threshold. Fail the deploy if it drops.
Content policy guardrail in production: takes 10 lines of code. Catches the stuff that gets you on the news.
Performance benchmarks before model swaps: never assume a model is "faster" or "cheaper." Benchmark it with your actual agent, your actual tools, your actual prompts.

The eval system in Pori is open source. The code is at github.com/aloysathekge/pori under pori/eval/. Every eval inherits from BaseEval, implements run(), and optionally pre_check()/post_check() for guardrail use.

The best agent isn't the one with the best prompt. It's the one you can actually measure.