AI/MLData StrategyProduct LeadershipSynthetic Data

Building an AI Evaluation Dashboard: Monitoring LLM Performance at Scale

Year

2025

Role

Product Manager

Company

BrandComms.ai @ Forethought

Introduction

As LLM applications scale, monitoring accuracy and detecting drift becomes critical. This post covers building an evaluation and monitoring system for a large-scale synthetic data generation pipeline, including a Tableau dashboard for real-time visibility.

The Challenge

We needed to:

  • Generate structured survey responses for demographic segments
  • Ensure consistent, high-quality outputs across batches
  • Detect performance drift early
  • Provide actionable insights to stakeholders

Scale: Processing 40+ demographic segments across 30+ survey questions, generating thousands of scored responses per batch.

System Architecture

Three-Phase Assessment Pipeline

The pipeline uses a structured approach:

Phase 1: Brand Priming & Context Establishment

  • Ensures the LLM understands the brand context
  • Validates understanding before scoring

Phase 2: Category Assessment with Probing

  • Evaluates responses on four criteria (Relevance, Depth, Clarity, Perspective)
  • Iterative probing when scores are below threshold
  • Pass threshold: 79/100 across all criteria

Phase 3: Semantic-to-Numeric Translation

  • Generates semantic responses first
  • Translates to numeric scores (0-10 scale) with reasoning
  • Maintains consistency across questions

Key Technical Features

Resume Capability

  • Checkpoint system saves progress after each subsegment
  • Automatic resume from last completed point
  • Preserves all assessment dialogues and ratings

Versioned Prompt System

  • Jinja2 templates for all prompts
  • Version registry for A/B testing
  • Easy rollback and iteration

Comprehensive Logging

  • Full assessment dialogues preserved
  • Semantic responses and translation notes
  • Detailed reasoning chains for auditability

The Monitoring Solution

Data Pipeline

A Python ETL pipeline processes outputs:

  • Data Extraction: Reads scores, assessments, and detailed logs from multiple batch runs
  • Transformation: Calculates accuracy metrics, drift indicators, and performance statistics
  • Loading: Exports structured CSV files optimized for Tableau consumption

Key Metrics Calculated:

Assessment pass rates by category
Average iterations per assessment
Criteria score trends
Mean score deltas between batches
Variance changes over time
Outlier detection and missing data rates

Dashboard Architecture

Four interconnected dashboards provide visibility:

Dashboard 1: LLM Accuracy Monitor

Purpose: Track assessment quality and LLM performance

Key Visualizations:

Top Row - Summary KPIs:

  • • Overall Assessment Pass Rate (with trend indicator)
  • • Average Iterations Required (lower is better)
  • • Composite Criteria Score Average

Middle Section - Category Analysis:

  • • Horizontal bar chart: Pass Rate by Category
  • • Histogram: Distribution of Iteration Counts

Bottom Section - Trend Analysis:

  • • Line chart: Pass Rate Trend Over Time
  • • Line chart: Criteria Scores Over Time

Insights Provided: Which categories require more probing, whether LLM performance is improving or degrading, and early warning when pass rates drop below thresholds.

Dashboard 2: Data Drift Monitor

Purpose: Detect shifts in score distributions and identify anomalies

Key Visualizations:

Distribution Comparison:

  • • Overlapping histograms: Baseline vs Current batch
  • • Visual comparison of shape, mean, and variance shifts

Drift Heatmap:

  • • Rows: Question codes (33 questions)
  • • Columns: Batches over time
  • • Color intensity: Magnitude of mean score delta

Alert System:

  • • Automatic flagging when mean delta exceeds ±1.0
  • • Highlighting of questions with significant variance changes
  • • Outlier detection visualization

Use Case Example: "We noticed a subtle shift in Product category scores. The drift dashboard highlighted that CVAPA3 (quality ingredients) showed a -1.2 point delta. Investigation revealed a prompt version change that affected this specific question."

Dashboard 3: Performance Overview

Purpose: Track batch execution progress and system health

Key Visualizations:

Progress Tracking:

  • • Gauge chart: Batch Completion Status
  • • Color zones: Green (80-100%), Yellow (50-80%), Red (<50%)
  • • Real-time updates from checkpoint files

Detailed Status Table:

  • • Subsegment-level completion status
  • • Last updated timestamps
  • • Color-coded status (Complete, In Progress, Failed)

Real-World Impact: "During a batch run, we noticed completion stalled at 35/42 subsegments. The dashboard immediately flagged this, and we discovered an API rate limit issue. Quick intervention prevented data loss."

Dashboard 4: Question-Level Analysis

Purpose: Deep dive into individual question performance and patterns

Key Visualizations:

Score Distributions:

  • • Box plots: Score distribution by question code
  • • Outlier identification (scores outside 1.5×IQR)

Cohort Heatmap:

  • • Rows: Demographic cohorts (42 cohorts)
  • • Columns: Question codes
  • • Pattern identification across demographics

Advanced Filtering:

  • • Filter by category, cohort, subsegment
  • • Multi-select capabilities for comparative analysis
  • • Drill-down from summary to individual responses

Interactive Dashboard Demonstrations

Explore live visualizations showcasing the monitoring capabilities of the AI Evaluation Dashboard. All data represents realistic production metrics.

1. LLM Accuracy Monitor

Track assessment quality, iteration patterns, and LLM performance across categories and batches.

Overall Pass Rate
92.4%
↑ 2.1% from last batch
Avg Iterations
1.42
↓ 0.08 from last batch
Avg Criteria Score
87.8
↑ 1.3 from last batch

Pass Rate by Category

Iteration Distribution

Pass Rate Trend Over Time

Criteria Scores Over Time

2. Data Drift Monitor

Detect shifts in score distributions, identify anomalies, and receive alerts when drift exceeds thresholds.

Questions with High Drift
3
Alert threshold: ±1.0
Mean Score Delta
-0.18
vs Baseline (Batch 1)
Outlier Count
12
2.1% of total scores

Score Distribution: Baseline vs Current

Mean Score Delta by Question

Red bars indicate negative drift, green bars indicate positive drift

High Drift Alert: CVAPA3

Question Category:Product (Quality Ingredients)
Mean Score Delta:-1.2 points
Baseline Mean:7.8
Current Mean:6.6
Investigation Note: Score shift detected after prompt version 3.2 deployment. Consider reviewing prompt changes for this specific question category.

3. Performance Overview

Monitor batch execution progress, track system health, and view real-time completion status.

Batch Completion Status

78%Complete
Completed Subsegments:33 / 42
Scores Generated:1,089 / 1,386
Estimated Time Remaining:~45 minutes

Processing Time Trend

Average processing time: 2.1 hours per batch

Subsegment Completion Status

Cohort IDSubsegmentStatusLast Updated
1Tech-Savvy MillennialsComplete2 min ago
2Budget-Conscious FamiliesComplete5 min ago
3Health-Focused SeniorsComplete8 min ago
4Urban ProfessionalsIn ProgressJust now
5Rural TraditionalistsPending-
6Eco-Conscious ShoppersPending-

4. Question-Level Analysis

Deep dive into individual question performance, identify patterns across demographics, and compare score distributions.

Total Questions
33
Across 5 categories
Average Score
7.6
All questions (0-10 scale)
Score Std Dev
1.4
Consistency metric

Top 5 & Bottom 5 Questions by Average Score

Score Distribution by Category (Box Plot)

Product
Median: 7.8
Service
Median: 7.9
Reputation
Median: 7.2
Price
Median: 7.7
Brand
Median: 8.3

Reading the visualization:

  • • Line extends from minimum to maximum score
  • • Box represents Q1 (25th percentile) to Q3 (75th percentile)
  • • Vertical line inside box shows the median

Key Insights

  • Brand Priming questions consistently score highest (median 8.3), indicating strong LLM understanding of brand context.
  • Reputation questions show more variability (wider distribution), suggesting this category may benefit from prompt refinement.
  • CVAPA3 and CVAR1 are consistent low performers across batches, warranting deeper investigation into question design or prompt handling.

Key Achievements

1. Early Detection

The drift dashboard identified subtle shifts before they impacted downstream analysis, enabling prompt adjustments.

2. Quality Assurance

The accuracy monitor provided visibility into LLM performance, helping optimize prompts and reduce iterations.

3. Operational Efficiency

The performance dashboard enabled proactive monitoring, reducing manual checks and improving batch reliability.

4. Data-Driven Decisions

Question-level analysis revealed patterns that informed prompt engineering and question design.

Technical Stack

Backend

Python 3.9+, pandas, numpy

LLM Integration

OpenAI GPT-4

Data Pipeline

Custom ETL with checkpoint/resume

Visualization

Tableau Desktop/Server

Data Storage

CSV files with JSON metadata

Version Control

Git for prompt templates

Conclusion

This system demonstrates:

  • Scalability: Handling thousands of evaluations across multiple batches
  • Reliability: Resume capability and comprehensive error handling
  • Observability: Dashboards providing actionable insights
  • Quality Assurance: Multi-phase assessment ensuring consistent outputs

The combination of a robust pipeline and monitoring dashboards enables confident deployment of LLM-based synthetic data generation at scale.

Interested in similar solutions?

Let's discuss how AI evaluation and monitoring systems can enhance your product development.

Get in touch