Building an Agent Evaluation System:
A Practical Guide with Experiments

Nai5 · Independent Researcher · January 2026

Abstract

This paper presents a practical approach to building evaluation systems for AI agents. We demonstrate two key findings through controlled experiments:

Prompt structure significantly impacts quality: Few-shot prompts achieve 100% accuracy vs 93.3% for basic prompts, while using 45% fewer tokens than structured prompts.
Memory is critical for multi-turn tasks: Agents without memory score 2.3/10 on consistency vs 9.3/10 with full history.

All code is reproducible and provided.

1. Introduction

Building reliable AI agents requires answering: How do you know if your agent is good?

This paper addresses two measurable aspects:

Prompt Engineering: Which prompt structures produce better results?
Context Engineering: How does memory affect multi-turn performance?

2. Experiment 1: Prompt Optimization

2.1 Method

Task: Code generation (5 functions)

Prompt Templates:

BASIC: "{task}. Output only code."

STRUCTURED: "Task: {task}
Requirements: Handle edge cases
Output:"

FEW_SHOT: "Example: def double(n): return n*2
Now: {task}"

COT: "Think step by step, then write {task}"

2.2 Results

Prompt Style	Pass Rate	Avg Tokens	Verdict
basic	93.3%	11	⚠️ Misses edge cases
structured	100%	32	✓ Complete but verbose
few_shot	100%	22	🏆 Best efficiency
cot	100%	61	Overkill for simple tasks

Key Finding: Basic prompt failed on count_vowels because it didn't handle uppercase. Few-shot achieved 100% with 45% fewer tokens than structured.

2.3 Why Basic Failed

# Basic prompt produced:
def count_vowels(s):
    return sum(1 for c in s if c in 'aeiou')

# Failed test:
assert count_vowels("HELLO") == 2  # Returns 0!

3. Experiment 2: Memory Impact

3.1 Method

Scenarios: 3 multi-turn dialogues (restaurant booking, tech support, product inquiry)

Memory Configs: No memory, Full history, Summary-based

3.2 Results

Memory Config	Consistency Score	Observation
No Memory	2.33/10	Catastrophic - asks for given info repeatedly
Full History	9.33/10	Near-perfect retention
Summary Memory	7.33/10	78% of full history effectiveness

Key Finding: Without memory, agents lose context completely. Even a 3-turn conversation breaks down (repeatedly asking "how many people?" after being told twice).

3.3 No-Memory Failure Example

Turn 1: "Book for 4 people"
Agent: "When would you like to dine?"

Turn 2: "Saturday 7pm"  
Agent: "For Saturday 7pm - how many people?" ❌ FORGOT

Turn 3: "Change to 6"
Agent: "6 people - when would you like to come?" ❌ FORGOT AGAIN

4. Building the System

4.1 Architecture

EVALUATION PIPELINE
├── L1: Rules (0.01ms, $0) - format, syntax
├── L2: Tests (40ms, $0) - execution, assertions  
└── L3: LLM Judge (500ms, $0.01) - quality, nuance

4.2 Minimal Working Example

import anthropic

client = anthropic.Anthropic()
TEST_CASES = [
    {"prompt": "Write sum_list function", 
     "test": "assert sum_list([1,2,3]) == 6"}
]
PROMPTS = {
    "basic": lambda t: f"{t}. Output only code.",
    "few_shot": lambda t: f"Example: def double(n): return n*2\n\nNow: {t}"
}

for ptype, pfn in PROMPTS.items():
    passed = 0
    for case in TEST_CASES:
        resp = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=500,
            messages=[{"role": "user", "content": pfn(case["prompt"])}]
        )
        code = resp.content[0].text
        try:
            exec(f"{code}\n{case['test']}")
            passed += 1
        except: pass
    print(f"{ptype}: {passed}/{len(TEST_CASES)}")

5. Recommendations

Task Type	Recommended Prompt
Simple (single function)	Few-shot
Medium (multi-step)	Structured
Complex (reasoning)	Chain-of-thought

Conversation Length	Recommended Memory
< 5 turns	Full history
5-20 turns	Summary-based
> 20 turns	Retrieval-augmented

6. Conclusion

Prompt structure matters: 6.7% accuracy difference between basic and few-shot
Memory is critical: 7-point consistency difference (2.3 vs 9.3)
Evaluation can be automated: Layer fast rules before expensive LLM judges

Full code available at the project repository.

Building an Agent Evaluation System:A Practical Guide with Experiments