Olde Maps - AI Directions with Wikipedia and Evaluation Scoring

This Olde Maps project represents a standout entry from the Mastra Templates Hackathon, demonstrating sophisticated AI evaluation techniques through a whimsical navigation system that measures agent performance against established ground truth.

Hackathon Recognition

Best Use of Eval 🏆 - Judged by Confident AI, recognizing exceptional demonstration of AI evaluation and scoring methodologies.

Why This Project Won

Evaluation as Core Feature: Judges praised how it showcases “Mastra Scores/evals clearly and accessibly, using them as a central part of the demo rather than an afterthought”

Clever Eval Design: Uses deliberately imperfect Wikipedia-only generation to make evaluation metrics “visible and actionable”

Educational Excellence: Recognized as a “great educational template for Mastra Scores/evals” and “good learning tool” for custom scorer implementation

Meaningful Metrics: Demonstrates practical evaluation by integrating “Google Maps purely for scoring/ground truth, not for generation, which makes the evals meaningful and educational”

Technical Architecture

Two-Agent Pattern Architecture

The system demonstrates specialized agent coordination:

Old Fella Agent: Provides whimsical persona-based directions with folksy charm and storytelling Directions Agent: Wikipedia-driven routing that searches pages and chains references to propose routes Intentional Design Choice: Deliberately avoids Google Maps for generation, using Wikipedia exclusively to create “charming but often inaccurate” guidance

Dual Evaluation System

Tool Hallucination Scorer: Measures whether generated directions use only information available from tools (Wikipedia), tracking hallucination rates Maps Performance Scorer: Calls Google Maps to compute reference routes and compares quality/performance against the Wikipedia-generated directions Separation of Concerns: Google Maps used purely for evaluation ground truth, never for generation, making the scoring both meaningful and educational

Demonstrated Capabilities

The judges observed live demo results showing:

Wikipedia Integration: Custom search and page extraction finding roads/places, chaining references to propose complete routes Measurable Hallucination: Tool Hallucination Scorer showed ~71% hallucination rate in demo run - “not great,” which was exactly the educational point Performance Comparison: Maps Performance Scorer demonstrated ~85% performance compared to Google Maps for the same itinerary
Educational Impact: Judges noted it’s “kind of funny, but also potentially useful and something to learn from” because it creates a compelling evaluation story Clear Eval Workflow: Concrete example of writing and wiring scorers into agent workflows

Technical Implementation

Core Technologies

Mastra Framework: Workflow orchestration for agent coordination and evaluation pipeline Wikipedia Integration: Exclusive use of Wikipedia search for routing information OpenAI Models: AI agents for direction generation and natural language processing Google Maps API: Ground truth comparison for route accuracy measurement TypeScript: Type-safe development for reliable evaluation system operation

Evaluation Pipeline

Route Generation: Wikipedia-based direction creation using folksy narrative style Performance Comparison: Systematic comparison against Google Maps routing Score Calculation: Quantified metrics for route accuracy and practical usefulness Hallucination Detection: Measurement of AI-generated content versus factual geography

Judge Feedback from Demo

Confident AI Recognition (Best Use of Eval Award)

The judges specifically praised the project’s evaluation-centric design and educational value:

Core Evaluation Design: Recognized for making Mastra Scores “a central part of the demo rather than an afterthought” Educational Template: Judges called it a “great educational template for Mastra Scores/evals” and “good learning tool” Clever Methodology: Praised the “deliberately imperfect generation to make metrics visible and actionable” Not Overly Complex: Noted as “not overly comprehensive, but a good starting spot” for learning evaluation techniques Concrete Implementation: Valued as a “concrete example of writing and wiring scorers into an agent workflow”

Architectural Insights

AI Evaluation Best Practices

This project demonstrates several important evaluation concepts:

Ground Truth Establishment: Using established services (Google Maps) as baseline for performance measurement Custom Scorer Development: Writing domain-specific evaluation logic for unique use cases
Quantified Assessment: Moving beyond subjective evaluation to data-driven performance metrics Hallucination Measurement: Systematic detection of AI-generated content that deviates from reality

Creative Evaluation Design

Problem Framing: Using whimsical, intentionally flawed navigation to make evaluation concepts engaging User Experience: Balancing entertainment value with serious evaluation methodology Educational Template: Providing clear patterns for custom evaluation implementation Performance Visualization: Making evaluation results accessible and understandable

Production Applications

AI System Validation: Template for evaluating any AI system against established benchmarks Custom Scoring Development: Patterns for writing domain-specific evaluation logic Performance Monitoring: Continuous assessment of AI agent accuracy and reliability Hallucination Detection: Systematic identification of AI-generated inaccuracies

Educational Value

Evaluation Methodology: Practical demonstration of AI scoring and validation techniques Custom Scorer Patterns: Clear examples of implementing domain-specific evaluation logic Ground Truth Comparison: Best practices for establishing baseline performance measurements Mastra Scores Usage: Comprehensive example of Mastra’s evaluation framework capabilities

Why This Project Matters

Evaluation Advocacy: Demonstrates the critical importance of systematic AI evaluation in production systems Template Excellence: Provides a comprehensive foundation for building custom evaluation systems Methodology Innovation: Shows how creative problem framing can make complex evaluation concepts accessible Production Readiness: Offers patterns applicable to real-world AI system validation needs

This project showcases how thoughtful evaluation design can transform AI assessment from subjective judgment to quantified, systematic measurement. The recognition by Confident AI highlights its value as both an educational template and a practical demonstration of sophisticated evaluation techniques that are essential for reliable AI system deployment.

Jeffrey Hicks