harryliu.space / projects / synthetic-data-engine
Project · Case Study

Synthetic Data Engine

Replaced panel recruitment for brand research — faster, cheaper, and still statistically defensible.

2024–2025 Workplace build · brand research team Anchored enterprise commercial engagement
§ 01 — Context

Brand research lives or dies on panel recruitment.

Brand research runs on panel recruitment — recruiting real humans to answer survey questions. It's slow (weeks per study), expensive (dollars per completed response), and hard to scale into new markets and demographic slices. For ongoing brand tracking studies — the bread-and-butter of the industry — the recruitment bottleneck is often the entire reason a study costs what it costs and takes as long as it takes.

The hypothesis I inherited: synthetic survey respondents, generated by LLMs, could replicate enough of the statistical signal of real respondents to replace panels for certain research use cases. If the hypothesis worked, it changed the unit economics of brand research — and with it, what the company could do commercially.

The commercial case was obvious. The technical problem was harder.

§ 02 — The product decision

The validation layer is the product.

An LLM generating "synthetic survey answers" is trivial. A pipeline that generates synthetic data that matches the distributions of real respondents on the axes that matter is not. The failure mode is dangerous: outputs that look plausible but fail to match the real population's distributions on key variables (age × region × brand preference, say) are worse than nothing — they are wrong data that looks right.

The biggest call I made on this project was a product call, not a technical one:

Any competitor can build a pipeline that generates synthetic survey answers. The moat is the methodology for proving the synthetic data matches real data on the axes the research depends on.

That reframing — from "build the generator" to "build the validator, with a generator attached" — shaped every architecture decision downstream. It also shaped how I thought about the product's commercial story: we weren't selling "AI that writes fake survey answers cheaply." We were selling statistically-validated synthetic data for specific research questions where we could demonstrate fidelity to real panels. That's a narrower claim, but a defensible one, and defensibility is what closes enterprise deals.

The scoping decision that followed: don't try to replace panels for every kind of research. Some questions — new-concept testing, latent-attitude research, cultural tracking — still need real humans. Synthetic was designed for the research questions where the signal-to-noise gap between methods is small. I led an 8-supplier competitive analysis to map where competitors overclaimed and where the honest use cases were — the scoping came out of that work.

§ 03 — How it works

Three layers. The third is where the moat lives.

Layer 1 — Persona generation
Upstream
Given a target population defined by demographic priors (age × region × demographic slice × relevant category behaviours), agents generate respondent personas with distributional fidelity to the reference data. Not one monolithic persona-generating prompt — a pipeline that samples from priors, generates, and validates that the persona distribution matches expected marginals before any respondent answers a question.
Technical: multi-agent LLM orchestration · demographic-prior sampling · marginal-distribution validation against reference panel data
Layer 2 — Response generation
Middle
Given a persona and a questionnaire, a response-generation agent produces answers consistent with the persona's demographics, stated preferences, and the survey's internal logic (scale items correlate appropriately, attention-check questions pass, skip logic is respected). The prompts include explicit consistency rules the output must satisfy, and include follow-up validation against those rules.
Technical: structured output generation · consistency constraints · attention-check compliance · skip-logic adherence
Layer 3 — Statistical validation
The moat
Every synthetic batch is compared against reference real-panel distributions using distribution-comparison tests (Kolmogorov-Smirnov and chi-square on marginal distributions), joint-distribution checks on the pairs of variables the research depends on, and aggregate quality scoring that rolls the statistical fit into a single decision signal. The validation layer is what lets us say, quantitatively, "for this research question, synthetic data is within X% of real on the relevant axes." Without it, the pipeline is a clever chatbot.
Technical: K-S test · chi-square · joint-distribution analysis · quality scoring · regression against reference panels

A few decisions worth naming beyond the architecture:

  • Reference data is not optional. Every batch is validated against a real reference panel the company owns. The pipeline is not "synthetic data out of nothing" — it's "synthetic data calibrated against the real distributions we already have, and flagged when it drifts from them." The reference data is the anchor that lets us make quantitative fidelity claims.
  • Quality scoring is consumer-facing. The fidelity score is surfaced to research analysts as a confidence signal on every synthetic batch. A batch with a low score doesn't ship. That's how statistical validation became the product experience, not a back-office check.
  • Production deployment on AWS. This isn't "run once on a laptop" — it's an ongoing production system that generates data for live brand tracking studies. The engineering work to get from notebook-prototype to production-pipeline was real, and it's where cloud infrastructure experience paid off.

The honest scoping. The pipeline does not replace panels for all research. It replaces panels for research questions where (a) the reference distributions are known, (b) the statistical tests we run can detect meaningful drift, and (c) the cost-benefit of synthetic-over-real is favourable at the fidelity we can demonstrate. That narrow, honest scope is what makes it commercially defensible.

§ 04 — Outcome

Faster, cheaper — and provably defensible.

100%
of strong-significance drivers (p<0.01) reproduced from real panel data

The headline number: 100% of drivers significant at p<0.01 in human panel data were reproduced in the synthetic output. In multivariate analysis at the standard p<0.05 threshold, the pipeline reproduced 65–75% of significant drivers.

In plain decision terms: any research conclusion that would have been confident on human panel data stayed confident on synthetic. Borderline multivariate calls got recovered most of the time, but not always — fidelity enough for the research questions we scoped around, not enough for the ones we didn't. The product never promised more than that.

For those questions, that level of fidelity was enough to replace panel recruitment on cost, speed, and defensibility — without losing the signal the research was paid to find.

That quantitative case is what made the commercial story real. Anchored an enterprise commercial engagement as lead technical — translated the methodology into cost, turnaround, and pricing implications for the buyer. Positioned the approach against an 8-supplier competitive landscape in a client-facing brief I led.

The product never needed to claim it replaced real humans for every research question. It needed to credibly replace panels for the specific research questions where we could demonstrate fidelity. Narrow claim, measured evidence, commercial traction.

100%
Drivers replicated at p<0.01
65–75%
At p<0.05 multivariate
Enterprise
Commercial engagement anchored
§ 05 — What I learned

For AI in high-stakes domains, evaluation is the product.

The deepest product lesson of my career came from this project — and it ties to the same design principle that shows up everywhere else in my work: simple on the surface, rigorous underneath. To the research analyst using it, the synthetic data looks like a normal panel export. Underneath, the whole pipeline has to pass statistical tests before anything ships.

For AI products in high-stakes domains, the evaluation layer is the product. Anyone can generate outputs. Very few can prove their outputs match reality on the axes that matter — and for buyers who make real decisions on the outputs, proof of fidelity is what they're actually paying for. If you can't show the fidelity, you can't charge for it. If you can, you have a moat that generation alone will never give you.

The LLM is the commodity. Knowing whether it's bluffing is the moat.

Generation is commoditised. Defensible measurement of fidelity is not.

The second lesson was about translation. Commercial work on a technical product lives or dies on the translation to buyer language. Enterprise buyers do not want to know how the pipeline works — they want to know what it costs, what it replaces, what the confidence interval looks like, and what breaks when it breaks. Building the pipeline was the easier half of the work. Making it commercially legible was the harder half, and the one I'd want to spend more time on in the next product I build.