Award-winning whimsical navigation system demonstrating Mastra Scores for AI evaluation, comparing Wikipedia-based directions against Google Maps ground truth
This Olde Maps project represents a standout entry from the Mastra Templates Hackathon, demonstrating sophisticated AI evaluation techniques through a whimsical navigation system that measures agent performance against established ground truth.
Best Use of Eval 🏆 - Judged by Confident AI, recognizing exceptional demonstration of AI evaluation and scoring methodologies.
Evaluation as Core Feature: Judges praised how it showcases “Mastra Scores/evals clearly and accessibly, using them as a central part of the demo rather than an afterthought”
Clever Eval Design: Uses deliberately imperfect Wikipedia-only generation to make evaluation metrics “visible and actionable”
Educational Excellence: Recognized as a “great educational template for Mastra Scores/evals” and “good learning tool” for custom scorer implementation
Meaningful Metrics: Demonstrates practical evaluation by integrating “Google Maps purely for scoring/ground truth, not for generation, which makes the evals meaningful and educational”
The system demonstrates specialized agent coordination:
Old Fella Agent: Provides whimsical persona-based directions with folksy charm and storytelling Directions Agent: Wikipedia-driven routing that searches pages and chains references to propose routes Intentional Design Choice: Deliberately avoids Google Maps for generation, using Wikipedia exclusively to create “charming but often inaccurate” guidance
Tool Hallucination Scorer: Measures whether generated directions use only information available from tools (Wikipedia), tracking hallucination rates Maps Performance Scorer: Calls Google Maps to compute reference routes and compares quality/performance against the Wikipedia-generated directions Separation of Concerns: Google Maps used purely for evaluation ground truth, never for generation, making the scoring both meaningful and educational
The judges observed live demo results showing:
Wikipedia Integration: Custom search and page extraction finding roads/places, chaining references to propose complete routes
Measurable Hallucination: Tool Hallucination Scorer showed ~71% hallucination rate in demo run - “not great,” which was exactly the educational point
Performance Comparison: Maps Performance Scorer demonstrated ~85% performance compared to Google Maps for the same itinerary
Educational Impact: Judges noted it’s “kind of funny, but also potentially useful and something to learn from” because it creates a compelling evaluation story
Clear Eval Workflow: Concrete example of writing and wiring scorers into agent workflows
Mastra Framework: Workflow orchestration for agent coordination and evaluation pipeline Wikipedia Integration: Exclusive use of Wikipedia search for routing information OpenAI Models: AI agents for direction generation and natural language processing Google Maps API: Ground truth comparison for route accuracy measurement TypeScript: Type-safe development for reliable evaluation system operation
Route Generation: Wikipedia-based direction creation using folksy narrative style Performance Comparison: Systematic comparison against Google Maps routing Score Calculation: Quantified metrics for route accuracy and practical usefulness Hallucination Detection: Measurement of AI-generated content versus factual geography
The judges specifically praised the project’s evaluation-centric design and educational value:
Core Evaluation Design: Recognized for making Mastra Scores “a central part of the demo rather than an afterthought” Educational Template: Judges called it a “great educational template for Mastra Scores/evals” and “good learning tool” Clever Methodology: Praised the “deliberately imperfect generation to make metrics visible and actionable” Not Overly Complex: Noted as “not overly comprehensive, but a good starting spot” for learning evaluation techniques Concrete Implementation: Valued as a “concrete example of writing and wiring scorers into an agent workflow”
This project demonstrates several important evaluation concepts:
Ground Truth Establishment: Using established services (Google Maps) as baseline for performance measurement
Custom Scorer Development: Writing domain-specific evaluation logic for unique use cases
Quantified Assessment: Moving beyond subjective evaluation to data-driven performance metrics
Hallucination Measurement: Systematic detection of AI-generated content that deviates from reality
Problem Framing: Using whimsical, intentionally flawed navigation to make evaluation concepts engaging User Experience: Balancing entertainment value with serious evaluation methodology Educational Template: Providing clear patterns for custom evaluation implementation Performance Visualization: Making evaluation results accessible and understandable
AI System Validation: Template for evaluating any AI system against established benchmarks Custom Scoring Development: Patterns for writing domain-specific evaluation logic Performance Monitoring: Continuous assessment of AI agent accuracy and reliability Hallucination Detection: Systematic identification of AI-generated inaccuracies
Evaluation Methodology: Practical demonstration of AI scoring and validation techniques Custom Scorer Patterns: Clear examples of implementing domain-specific evaluation logic Ground Truth Comparison: Best practices for establishing baseline performance measurements Mastra Scores Usage: Comprehensive example of Mastra’s evaluation framework capabilities
Evaluation Advocacy: Demonstrates the critical importance of systematic AI evaluation in production systems Template Excellence: Provides a comprehensive foundation for building custom evaluation systems Methodology Innovation: Shows how creative problem framing can make complex evaluation concepts accessible Production Readiness: Offers patterns applicable to real-world AI system validation needs
This project showcases how thoughtful evaluation design can transform AI assessment from subjective judgment to quantified, systematic measurement. The recognition by Confident AI highlights its value as both an educational template and a practical demonstration of sophisticated evaluation techniques that are essential for reliable AI system deployment.