SurveyForge — Harry Liu

§ 01 — Context

A specialist-gated bottleneck, not a capacity problem.

At Forethought's brand research team, survey design was a three-week, specialist-gated process. Every client delivery waited for the survey team to translate a brief into a methodologically-sound questionnaire. The problem wasn't capacity — it was expertise. Designing a good survey requires methodology knowledge most people don't have: question order effects, Likert scale construction, loaded-question avoidance, branching logic, attention-check placement.

The obvious AI opportunity was to pave that path — let non-experts produce specialist-quality surveys. The obvious wrong answer was equally visible: an LLM chat UI that "helps you write survey questions" would produce the wrong outputs at scale, because the rules a good survey follows are not things LLMs stay consistent with when the user doesn't know what to ask for.

§ 02 — The product decision

Constrain the input, not just the output.

I chose to constrain the intake. SurveyForge guides a user through structured questions — what are you measuring, who are you asking, what decision does the answer need to support — and uses those answers to produce a properly-specified survey within methodology guardrails. The UI is a wizard, not a chat.

Open-chat LLM tools are more flexible but fail silently when the user doesn't know what to ask for. A wizard is less sexy but produces a reliable output every time. For a tool handed to non-specialists, reliability beats flexibility.

That's the tradeoff worth naming explicitly. It's also the tradeoff I'd make again. Consumer AI products can absorb the cost of open chat because the user can iterate; an internal tool that produces a deliverable the team will then work with for weeks cannot.

§ 03 — How it works

Three layers: intake, methodology, generation.

The tool runs in three internal stages:

Structured intake. Eight to ten questions about the research objective, audience, decision context, and target analysis plan. Required fields are required — you cannot proceed without declaring what you're measuring. This is where non-specialists get scaffolding a specialist would provide in person.
Methodology guardrails. A constraint layer that encodes survey design rules — question order effects, Likert construction, balanced response scales, avoiding double-barrelled items, matrix-vs-single-select decisions, attention-check placement. These rules live outside the LLM prompt and get enforced on the output.
LLM generation. Only at this point does a language model produce the actual questionnaire, given the intake and constrained by the methodology layer. The prompt includes the research context and the specific rules the output must satisfy.

After generation, the output is scored against the methodology rules before the user sees it. Failures get flagged. A user can override scores — but the tool makes them do so deliberately, with a note recording why. That's the honesty-loop that makes the tool trustworthy to the specialist team: anything non-specialist-produced is auditable.

The key technical decision: keeping the methodology layer outside the LLM prompt, not encoded as prose in the system message. Rules as code (a validator) are enforceable; rules as prose instructions (a prompt) drift when the model is creative.

§ 04 — Outcome

From three weeks to three days — and an adopted tool.

>80%

Reduction in specialist time per survey

Piloted on multiple live client projects. Adopted by analyst and consultant teams after the pilot. Cycle time fell from three weeks to three days — over 80% reduction in specialist time per survey, multiplicative across the team's workload. Presented firm-wide as the company's first AI transformation initiative.

3 wks → 3 d

Cycle time

3 projects

Pilot scope

Adopted

Survey team

§ 05 — What I learned

The hardest part isn't the AI.

The product-level lesson worth keeping: the hardest part of an AI product isn't the AI. It's deciding which rules the AI isn't allowed to break.

If I'd started with "build the LLM pipeline," I'd have shipped an unreliable tool that produced plausible-looking but methodologically-wrong surveys — the worst failure mode, because the user can't tell. Starting with "what rules does a good survey follow" made the build tractable, the output defensible, and the adoption possible. The domain expertise was the product; the LLM was the execution layer.

The related lesson, which I'd now apply to any AI-product decision: if you can't write the quality rules down as code, you shouldn't be building the tool yet. If they're only in your head, you don't understand the problem well enough to constrain it.