Building an Agent Evaluation System:
A Practical Guide with Experiments
Nai5 ยท Independent Researcher ยท January 2026
Abstract
This paper presents a practical approach to building evaluation systems for AI agents. We demonstrate two key findings through controlled experiments:
- Prompt structure significantly impacts quality: Few-shot prompts achieve 100% accuracy vs 93.3% for basic prompts, while using 45% fewer tokens than structured prompts.
- Memory is critical for multi-turn tasks: Agents without memory score 2.3/10 on consistency vs 9.3/10 with full history.
All code is reproducible and provided.
1. Introduction
Building reliable AI agents requires answering: How do you know if your agent is good?
This paper addresses two measurable aspects:
- Prompt Engineering: Which prompt structures produce better results?
- Context Engineering: How does memory affect multi-turn performance?
2. Experiment 1: Prompt Optimization
2.1 Method
Task: Code generation (5 functions)
Prompt Templates:
BASIC: "{task}. Output only code."
STRUCTURED: "Task: {task}
Requirements: Handle edge cases
Output:"
FEW_SHOT: "Example: def double(n): return n*2
Now: {task}"
COT: "Think step by step, then write {task}"
2.2 Results
| Prompt Style | Pass Rate | Avg Tokens | Verdict |
|---|---|---|---|
| basic | 93.3% | 11 | โ ๏ธ Misses edge cases |
| structured | 100% | 32 | โ Complete but verbose |
| few_shot | 100% | 22 | ๐ Best efficiency |
| cot | 100% | 61 | Overkill for simple tasks |
Key Finding: Basic prompt failed on
count_vowels because it didn't handle uppercase. Few-shot achieved 100% with 45% fewer tokens than structured.
2.3 Why Basic Failed
# Basic prompt produced:
def count_vowels(s):
return sum(1 for c in s if c in 'aeiou')
# Failed test:
assert count_vowels("HELLO") == 2 # Returns 0!
3. Experiment 2: Memory Impact
3.1 Method
Scenarios: 3 multi-turn dialogues (restaurant booking, tech support, product inquiry)
Memory Configs: No memory, Full history, Summary-based
3.2 Results
| Memory Config | Consistency Score | Observation |
|---|---|---|
| No Memory | 2.33/10 | Catastrophic - asks for given info repeatedly |
| Full History | 9.33/10 | Near-perfect retention |
| Summary Memory | 7.33/10 | 78% of full history effectiveness |
Key Finding: Without memory, agents lose context completely. Even a 3-turn conversation breaks down (repeatedly asking "how many people?" after being told twice).
3.3 No-Memory Failure Example
Turn 1: "Book for 4 people"
Agent: "When would you like to dine?"
Turn 2: "Saturday 7pm"
Agent: "For Saturday 7pm - how many people?" โ FORGOT
Turn 3: "Change to 6"
Agent: "6 people - when would you like to come?" โ FORGOT AGAIN
4. Building the System
4.1 Architecture
EVALUATION PIPELINE
โโโ L1: Rules (0.01ms, $0) - format, syntax
โโโ L2: Tests (40ms, $0) - execution, assertions
โโโ L3: LLM Judge (500ms, $0.01) - quality, nuance
4.2 Minimal Working Example
import anthropic
client = anthropic.Anthropic()
TEST_CASES = [
{"prompt": "Write sum_list function",
"test": "assert sum_list([1,2,3]) == 6"}
]
PROMPTS = {
"basic": lambda t: f"{t}. Output only code.",
"few_shot": lambda t: f"Example: def double(n): return n*2\n\nNow: {t}"
}
for ptype, pfn in PROMPTS.items():
passed = 0
for case in TEST_CASES:
resp = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=500,
messages=[{"role": "user", "content": pfn(case["prompt"])}]
)
code = resp.content[0].text
try:
exec(f"{code}\n{case['test']}")
passed += 1
except: pass
print(f"{ptype}: {passed}/{len(TEST_CASES)}")
5. Recommendations
| Task Type | Recommended Prompt |
|---|---|
| Simple (single function) | Few-shot |
| Medium (multi-step) | Structured |
| Complex (reasoning) | Chain-of-thought |
| Conversation Length | Recommended Memory |
|---|---|
| < 5 turns | Full history |
| 5-20 turns | Summary-based |
| > 20 turns | Retrieval-augmented |
6. Conclusion
- Prompt structure matters: 6.7% accuracy difference between basic and few-shot
- Memory is critical: 7-point consistency difference (2.3 vs 9.3)
- Evaluation can be automated: Layer fast rules before expensive LLM judges
Full code available at the project repository.