Document-generation AI — regulated life-sciences workflow
AI inference cost & latency engineering
Cutting AI application cost ~70% while holding output quality
A document-generation product ran every long, multi-section generation on a premium frontier model — expensive, and slow at ~39 seconds per document. We treated both how we call the model and which model we call as engineering choices: parallel section generation, model right-sizing, a quality-gated fallback, and response caching.
The result: ~70% lower model cost and ~64% faster generation (39.1 s → 14.2 s), with output quality held at 0.85–1.0 parity against the original premium model — verified by an independent judge, not by the team that built it.
The most expensive line in the app was an untested assumption — that the frontier model was required. Measuring it was cheaper than paying for it. The same product got ~70% cheaper and ~64% faster, with quality we could prove against an independent judge.

