Building an Agent Evaluation System:
A Practical Guide with Experiments

Nai5 ยท Independent Researcher ยท January 2026

Abstract

This paper presents a practical approach to building evaluation systems for AI agents. We demonstrate two key findings through controlled experiments:

  1. Prompt structure significantly impacts quality: Few-shot prompts achieve 100% accuracy vs 93.3% for basic prompts, while using 45% fewer tokens than structured prompts.
  2. Memory is critical for multi-turn tasks: Agents without memory score 2.3/10 on consistency vs 9.3/10 with full history.

All code is reproducible and provided.

1. Introduction

Building reliable AI agents requires answering: How do you know if your agent is good?

This paper addresses two measurable aspects:

2. Experiment 1: Prompt Optimization

2.1 Method

Task: Code generation (5 functions)

Prompt Templates:

BASIC: "{task}. Output only code."

STRUCTURED: "Task: {task}
Requirements: Handle edge cases
Output:"

FEW_SHOT: "Example: def double(n): return n*2
Now: {task}"

COT: "Think step by step, then write {task}"

2.2 Results

Prompt StylePass RateAvg TokensVerdict
basic93.3%11โš ๏ธ Misses edge cases
structured100%32โœ“ Complete but verbose
few_shot100%22๐Ÿ† Best efficiency
cot100%61Overkill for simple tasks
Key Finding: Basic prompt failed on count_vowels because it didn't handle uppercase. Few-shot achieved 100% with 45% fewer tokens than structured.

2.3 Why Basic Failed

# Basic prompt produced:
def count_vowels(s):
    return sum(1 for c in s if c in 'aeiou')

# Failed test:
assert count_vowels("HELLO") == 2  # Returns 0!

3. Experiment 2: Memory Impact

3.1 Method

Scenarios: 3 multi-turn dialogues (restaurant booking, tech support, product inquiry)

Memory Configs: No memory, Full history, Summary-based

3.2 Results

Memory ConfigConsistency ScoreObservation
No Memory2.33/10Catastrophic - asks for given info repeatedly
Full History9.33/10Near-perfect retention
Summary Memory7.33/1078% of full history effectiveness
Key Finding: Without memory, agents lose context completely. Even a 3-turn conversation breaks down (repeatedly asking "how many people?" after being told twice).

3.3 No-Memory Failure Example

Turn 1: "Book for 4 people"
Agent: "When would you like to dine?"

Turn 2: "Saturday 7pm"  
Agent: "For Saturday 7pm - how many people?" โŒ FORGOT

Turn 3: "Change to 6"
Agent: "6 people - when would you like to come?" โŒ FORGOT AGAIN

4. Building the System

4.1 Architecture

EVALUATION PIPELINE
โ”œโ”€โ”€ L1: Rules (0.01ms, $0) - format, syntax
โ”œโ”€โ”€ L2: Tests (40ms, $0) - execution, assertions  
โ””โ”€โ”€ L3: LLM Judge (500ms, $0.01) - quality, nuance

4.2 Minimal Working Example

import anthropic

client = anthropic.Anthropic()
TEST_CASES = [
    {"prompt": "Write sum_list function", 
     "test": "assert sum_list([1,2,3]) == 6"}
]
PROMPTS = {
    "basic": lambda t: f"{t}. Output only code.",
    "few_shot": lambda t: f"Example: def double(n): return n*2\n\nNow: {t}"
}

for ptype, pfn in PROMPTS.items():
    passed = 0
    for case in TEST_CASES:
        resp = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=500,
            messages=[{"role": "user", "content": pfn(case["prompt"])}]
        )
        code = resp.content[0].text
        try:
            exec(f"{code}\n{case['test']}")
            passed += 1
        except: pass
    print(f"{ptype}: {passed}/{len(TEST_CASES)}")

5. Recommendations

Task TypeRecommended Prompt
Simple (single function)Few-shot
Medium (multi-step)Structured
Complex (reasoning)Chain-of-thought
Conversation LengthRecommended Memory
< 5 turnsFull history
5-20 turnsSummary-based
> 20 turnsRetrieval-augmented

6. Conclusion


Full code available at the project repository.