Feb 12, 20252 min read

How we cut exam creation from days to 30 minutes using AI

The approach, what broke, and what I would do differently when building a multi-provider AI pipeline in production.

The problem was not "use AI." The problem was that manual exam creation took multiple days per test and the content team could not scale output without burning time on repetitive formatting, validation, and rewriting work.

Where the process actually broke

The raw pain point was obvious once I looked at the workflow end to end:

choose the exam structure
draft questions manually
normalize difficulty and topic coverage
review formatting inconsistencies
rebuild the final paper into something the platform could ingest

Every handoff introduced delay. Even when the content quality was strong, the pipeline around it was slow.

The system we built

I built a multi-provider AI pipeline that treated model calls as one step in a larger production workflow, not the workflow itself. The system coordinated:

prompt templates scoped to section and difficulty
provider fallback across OpenAI, Anthropic, and Groq
validation layers before content was accepted
structured outputs that mapped cleanly to the platform schema
human review where quality still needed a tighter loop

The most important design decision was keeping provider-specific logic at the edges. That let me switch models without rewriting the entire generation pipeline.

Reliability mattered more than novelty

The temptation with AI systems is to optimise for the most impressive demo. In production, the real win is consistent output under constraints. A boring but dependable pipeline beats a spectacular one that breaks under load or drifts in quality.

Fallbacks were mandatory

Providers fail differently. Some time out, some over-explain, some quietly ignore structure. Having multiple providers only helps if the rest of the pipeline is designed to tolerate that variability.

export async function generateWithFallback(tasks: Array<() => Promise<string>>) {
  for (const task of tasks) {
    try {
      return await task();
    } catch {
      continue;
    }
  }
 
  throw new Error("All providers failed");
}

That pattern by itself is simple. The harder part is attaching evaluation, schema validation, retries, and operator visibility around it.

What broke

A few things failed in predictable ways:

prompts that worked on one provider degraded badly on another
structured output assumptions collapsed when question complexity increased
retries amplified latency if validation rules were too loose
manual review still got blocked if generated content was hard to scan quickly

The lesson was that model quality was only one variable. Presentation, validation, and failure handling mattered just as much.

What I would do differently

If I rebuilt the pipeline now, I would push even harder on:

explicit schema guarantees at every stage
narrower prompts with clearer section ownership
evaluation datasets earlier in the process
a stronger operator dashboard for failure inspection

The headline metric was the big win: exam creation dropped from multiple days to under 30 minutes. But the real engineering value came from turning a fragile manual process into a repeatable system the team could trust.

Himanshu Agarwal

Founding engineer — three years solo across three live products (web, mobile, AI). I write about shipping real software end to end.

GitHub X LinkedIn

Last updated Mar 2, 2025.