How we cut exam creation from days to 30 minutes using AI
The approach, what broke, and what I would do differently when building a multi-provider AI pipeline in production.
The problem was not “use AI.” The problem was that manual exam creation took multiple days per test and the content team could not scale output without burning time on repetitive formatting, validation, and rewriting work.
Table of contents
- Where the process actually broke
- The system we built
- Reliability mattered more than novelty
- What broke
- What I would do differently
Where the process actually broke
The raw pain point was obvious once I looked at the workflow end to end:
- choose the exam structure
- draft questions manually
- normalize difficulty and topic coverage
- review formatting inconsistencies
- rebuild the final paper into something the platform could ingest
Every handoff introduced delay. Even when the content quality was strong, the pipeline around it was slow.
The system we built
I built a multi-provider AI pipeline that treated model calls as one step in a larger production workflow, not the workflow itself. The system coordinated:
- prompt templates scoped to section and difficulty
- provider fallback across OpenAI, Anthropic, and Groq
- validation layers before content was accepted
- structured outputs that mapped cleanly to the platform schema
- human review where quality still needed a tighter loop
The most important design decision was keeping provider-specific logic at the edges. That let me switch models without rewriting the entire generation pipeline.
Reliability mattered more than novelty
The temptation with AI systems is to optimise for the most impressive demo. In production, the real win is consistent output under constraints. A boring but dependable pipeline beats a spectacular one that breaks under load or drifts in quality.
Fallbacks were mandatory
Providers fail differently. Some time out, some over-explain, some quietly ignore structure. Having multiple providers only helps if the rest of the pipeline is designed to tolerate that variability.
export async function generateWithFallback(tasks: Array<() => Promise<string>>) {
for (const task of tasks) {
try {
return await task();
} catch {
continue;
}
}
throw new Error("All providers failed");
}That pattern by itself is simple. The harder part is attaching evaluation, schema validation, retries, and operator visibility around it.
What broke
A few things failed in predictable ways:
- prompts that worked on one provider degraded badly on another
- structured output assumptions collapsed when question complexity increased
- retries amplified latency if validation rules were too loose
- manual review still got blocked if generated content was hard to scan quickly
The lesson was that model quality was only one variable. Presentation, validation, and failure handling mattered just as much.
What I would do differently
If I rebuilt the pipeline now, I would push even harder on:
- explicit schema guarantees at every stage
- narrower prompts with clearer section ownership
- evaluation datasets earlier in the process
- a stronger operator dashboard for failure inspection
The headline metric was the big win: exam creation dropped from multiple days to under 30 minutes. But the real engineering value came from turning a fragile manual process into a repeatable system the team could trust.
Written by
Himanshu Agarwal
Founding engineer based in Bengaluru, building product systems that connect web, mobile, backend, and AI without handoff gaps.
Keep reading
Related posts
3 years as the only engineer at a startup
What I learned building two full EdTech platforms solo, from architecture decisions with no one to sanity check them to shipping on App Store and Play Store alone.
Discussion
Comments
Configure the `PUBLIC_GISCUS_*` environment variables to enable GitHub Discussions-powered comments here.