Case Study - Reaching frontier-model quality with a local model
Can a self-hosted open model replace a frontier cloud model for biosimilar and monoclonal-antibody CMC proposals — at parity quality, with full data privacy and fixed cost? Retrieval engineering and prompt/enrichment got a local OLMo-3.1-32B model to 7:6 parity against Claude Opus 4.8 with zero fine-tuning; a QLoRA fine-tune then closed the last gap, reaching parity even on the regulatory-judgment workflows RAG couldn’t teach — all confirmed by a strict multi-judge panel.
- Client
- Synthesis API — biologics CDMO proposals
- Year
- Service
- Local model engineering & evaluation

The problem
The proposal platform generates client-facing “Gene-to-GMP” CMC proposals across monoclonal antibodies (mAb), bispecific antibodies (bsAb), and biosimilars. Generation ran on Claude Opus 4.8 in the cloud. Two pressures pushed us toward a local model.
- Privacy — Proposals carry sensitive CDMO pricing and scope. Multi-tenant customers (Aragen, Syngene, and others) want on-prem isolation, not “send our pricing to a cloud LLM.”
- Cost at scale — Per-token cloud cost scales with volume; a self-hosted model is a fixed GPU cost.
The risk was the widely held assumption that open models are materially worse than frontier models. The job was to find out how much worse — and whether the gap is closable.
The hypothesis
The task is constrained, not open-ended: follow a fixed proposal structure, cite specific regulations, pull activities from a catalogue, and let a deterministic engine handle pricing. We hypothesised the frontier-model edge would be small for this shape of work, and that RAG plus validators could close most of it — because the hard part (knowledge and citations) is retrievable, not innate.
What we built
We built a decomposed, grounded generation pipeline — deliberately not a single giant LLM call.
- Regulatory RAG corpus — 49 official guideline documents (ICH Q-series, 21 CFR, FDA/EMA/WHO biosimilar guidance, EU GMP annexes, CTD M4Q), each with provenance and trust tier. Embedded with OpenAI text-embedding-3-large; retrieval is filterable by jurisdiction and status, so drafts can never set “Required.”
- Section-level generation — Each of the 13 workflow work-packages is generated independently and concurrently, not as one 9,000-token blob.
- Deterministic validators — Code (not the model) enforces structure (at least 2 client inputs and at least 2 deliverables), citation traceability (every cited regulation id must appear in retrieved context), trap-phrase avoidance, and a repair loop that re-prompts on failure.
The design principle that proved decisive: move rules from prompt into code, and decompose the task into bounded pieces. A weaker model is made reliable by not asking it to hold everything at once.
The experiments
Experiment A — Closed-book recall (the wrong test). OLMo scored 0.40 on a 365-probe closed-book CMC knowledge eval (judged by GPT-5.5). It looked bad. The lesson: closed-book recall is the wrong metric for a RAG system — the model doesn’t need to memorise ICH Q5B, it needs to reason over the retrieved text.
Experiment B — Open-book, with RAG (the right test). The same model, given retrieved context, produced grounded, correctly-cited proposal sections. Across 11 molecule families × 13 workflows = 143 sections, the pipeline passed 143/143 with the repair loop. The lesson: RAG closed the knowledge gap completely.
Experiment C — Head-to-head vs Claude, identical pipeline. The real A/B: same RAG, same prompts, same validators, same repair loop — only the model was swapped. We generated all 13 sections with each model, judged by an independent panel.
| Round | Configuration | Result (OLMo : Claude) |
|---|---|---|
| v1 | Baseline RAG | 3 : 10 — Claude clearly better |
| v2 | + specificity prompt + few-shot + CMC-specificity reference doc in RAG + enrichment pass | 7 : 6 — OLMo ahead |
| v3 | Over-tightened citations | 3 : 10 — regressed (scope-creep) |
| v4 | Scope-locked enrichment (deepen, don’t broaden) | 7 : 6 — confirmed, more stable |
Robust verification: the final tally was confirmed by a 3-judge panel per section, position-swapped, majority vote (78 GPT-5 judgments). Both v2 and v4 held at 7:6, with v4 showing cleaner consensus (more unanimous votes). The single-pass v3 swing proved that one judge is noise — panels are essential.
Closing the gap
A deep gap-analysis of every Claude win revealed the pattern: OLMo knew the rules but stayed generic where Claude got specific. Claude named exact attributes (deamidation, C-terminal lysine, afucosylation), named techniques (icIEF, CE-SDS, peptide mapping), mapped deliverables to CTD subsections, and reasoned about reference-product comparison. OLMo said “PTMs” and “orthogonal methods.”
The fix was four cheap, training-free levers.
| Lever | What it did | Effect |
|---|---|---|
| 1. Specificity-demanding prompt | Required named CQAs, named techniques, CTD mapping, biosimilar RP-logic | Forced elaboration |
| 2. Few-shot exemplar | Showed the depth bar in-context | Model mimicked it |
| 3. CMC Specificity Reference in RAG | A structured doc listing named attributes / techniques / CTD per workflow, making specifics retrievable | Turned an innate-capability gap into a retrieval problem |
| 4. Scope-locked enrichment pass | Second model call: deepen existing points, do not broaden the topic | Added depth without scope-creep |
Net effect: 3:10 → 7:6. The cheap levers didn’t just close the gap — they reversed it.
Two precision fixes added rigor. First, validator over-strictness: the citation checker wrongly flagged CTD coordinates (3.2.S.x / 3.2.P.x) as “unsupported.” A deep per-citation analysis of 6 “failing” sections showed 5 of 6 were validator false-positives — legitimate dossier structure, not fabricated citations. Whitelisting CTD coordinates lifted validation from 7/13 → 12/13.
Second, one genuine model error: OLMo cited “EU GMP Annex 2” for Qualified-Person batch certification, which is actually governed by Annex 16. The root cause was a retrieval gap (Annex 16 was missing from the corpus, and the generic query didn’t surface it), not a hallucination. Adding Annex 16 plus a QP-aware retrieval query fixed it: the section now cites Annex 16 correctly. Final validation: 13/13.
Performance on real hardware
We served OLMo-3.1-32B (BF16, 60 GB) on a single NVIDIA RTX PRO 6000 Blackwell (96 GB) via vLLM.
| Metric | Result |
|---|---|
| Single-request decode | ~20 tok/s (BF16, memory-bandwidth bound — matches published 32B benchmarks) |
| Concurrent throughput (8 parallel) | ~172 tok/s total, same wall-time as 1 request |
| Concurrent throughput (24 parallel) | ~481 tok/s total, near-linear scaling |
| Full 13-section proposal (concurrent) | ~60 seconds |
The key insight: single-request speed is the wrong metric for a multi-tenant SaaS. The Blackwell GPU batches concurrent requests — 8 proposals finish in the time of 1. Decomposing each proposal into 13 concurrent sections means a full proposal lands in ~60s, and many tenants’ proposals run in parallel. Published benchmarks show this card sustaining ~930 tok/s at 50 concurrent users.
Hardware lessons learned:
- BF16 32B (~64 GB) does not fit a 48 GB L40S — it needs quantization or a 96 GB card. The Blackwell fits it natively.
- Blackwell (sm_120) needed CUDA 12.9+ and FlashInfer disabled (an arch-detection bug); FlashAttention-2 worked. Bleeding-edge hardware has driver/library rough edges.
- Use a Deep Learning AMI (drivers pre-installed), not base Ubuntu.
Cost
| State | Cost |
|---|---|
| Blackwell instance running | ~$2.5–3.5 / hr |
| Stopped (model + weights persist) | ~$16 / month (EBS only) |
| Cloud frontier model | per-token, scales with volume |
For a high-volume, multi-tenant proposal product, the fixed-cost local model wins economically at scale — and is the only option that satisfies on-prem data privacy.
Closing the last gap with fine-tuning
Reaching 7:6 on RAG alone was the right first milestone — but the panel pinpointed exactly where to invest next. The two most regulation-dense workflows, regulatory CMC and comparability / biosimilarity, still lost to Claude, because those sections need learned regulatory judgment — what to include, correct CTD mapping, scope discipline — not retrievable facts. That was the textbook case for fine-tuning, so we ran the experiment deliberately: a targeted QLoRA adapter to teach the judgment RAG can’t. It paid off — the local model went from losing every hard section to matching the frontier model on exactly those workflows.
The data — best-of-N distillation. With no real CDMO proposals to train on, we synthesised the training set: for each (family × workflow × variant) task we generated a section with both Claude Opus 4.8 and GPT-5.5 through the same RAG pipeline, then kept the deterministic winner (validator pass plus completeness coverage). The result was 312 validator-clean examples across all 8 families × 13 workflows, with teacher wins splitting Claude 187 / GPT-5.5 125 — best-of-N genuinely raised the data ceiling above either single teacher. Facts stay in RAG; the fine-tune teaches behaviour, voice, and scope — not facts.
The training — two findings that drove the gains. We fine-tuned with QLoRA (4-bit nf4 frozen base plus a LoRA adapter) on the single GPU box, served by vLLM with the adapter loaded natively (no merge). Each version was measured against the panel, and disciplined code review of the training pipeline turned up two high-leverage fixes that each produced a step-change in quality.
| Version | What changed | Result (OLMo W / T / L vs Claude, 6-section panel) |
|---|---|---|
| v0 | First fine-tune — held out the target workflows from training | 0 / 1 / 5 — lost badly |
| v2 | Trained on regcmc / comparability + completion-only loss masking + 55 examples | 1 / 1 / 4 |
| v3 | 273 balanced examples + all fixes + stronger LoRA config | 2 / 2 / 2 — parity |
Bug 1 — we were holding out the exact workflows we were graded on. v0 put every regulatory-CMC and comparability example into the eval split, so the model never saw them in training — yet they were the entire test. The fix: train on them, and hold out a family (biosimilar) instead for an honest eval.
Bug 2 — no completion-only loss masking. Loss was computed over the whole sequence, including the ~10 RAG chunks stuffed into the prompt — so 80–90% of the gradient was spent reproducing retrieved boilerplate, not the regulatory voice. The fix: patch the chat template with generation markers so loss lands only on the gold section. Combined with stronger learning dynamics (rank 16→32 with rsLoRA, lr 1e-4→2e-4, 3→10 epochs — the old run did only ~18 optimizer steps, far too few to learn a voice), output validity jumped from 2/6 to 5/6 valid sections.
The result. Judged like a competition — GPT-5.5, 3 judges, position-swapped, majority vote — v3 reached parity on the hardest panel: the 2 most regulation-dense workflows across representative families, with the biosimilar family strictly held out of training.
| Section | Winner |
|---|---|
| bsAb comparability | OLMo (4–0) |
| mAb regulatory CMC | OLMo (3–0) |
| Biosimilar comparability | Tie (2–2) |
| Biosimilar regulatory CMC | Tie (3–3) |
| ADC regulatory CMC | Claude (0–6) |
| mAb comparability | Claude (0–5) |
2 wins / 2 ties / 2 losses = parity on the hardest panel. The local model went from losing every hard section (v0) to matching the frontier model on exactly the regulatory-judgment workflows RAG couldn’t reach. In fresh blind head-to-head generation against other strong models through the same RAG pipeline, v3 won the blind GPT-5.5 judge decisively (net +5 to +7) on these in-domain sections — the preferred author, not just a competitive one. We froze v3 as the production-candidate: versioned, backed up to S3, and wired into the serving stack for a one-config cutover from the cloud model to local.
- vs Claude Opus 4.8 on the hardest workflows
- Parity
- v3 fine-tune on the regulatory-judgment panel
- 2W / 2T / 2L
- Best-of-N distilled training examples
- 312
- Fixed on-prem cost when parked
- ~$16/mo
The verdict
OLMo-3.1-32B is at parity with Claude Opus 4.8 for biologics CMC proposal generation — private, fixed-cost, and customer-controlled. RAG and prompt/enrichment techniques alone reached 7:6 on the broad workflow set; a QLoRA fine-tune (v3) then closed the last gap, reaching parity even on the regulatory-judgment workflows RAG couldn’t teach. Both results were confirmed by a robust multi-judge panel.
- Quality (RAG) — 7:6 vs Claude on the broad set, zero fine-tuning. OLMo wins the science-depth sections (analytical, DSP, stability, USP, comparability, tech-transfer).
- Quality (fine-tune) — 2 wins / 2 ties / 2 losses on the hardest panel — parity on exactly the regulatory-CMC and comparability workflows Claude used to dominate, plus a decisive blind-judge advantage on the broader in-domain set.
- Citations — 13/13 sound, zero fabrications.
- Speed — Full proposal ~60s; ~480 tok/s under concurrency.
- Privacy & cost — Fully on-prem, ~$16/mo parked. v3 is frozen as the production-candidate, ready for a one-config cutover from cloud to local.
How we got there: RAG and validators close the knowledge gap cheaply; fine-tuning closes the judgment gap RAG can’t. The biggest training wins came not from more data but from fixing bugs in our own pipeline — a leaked holdout split and missing loss masking — surfaced by disciplined code review. Best-of-N distillation (keeping the validator-clean winner between two teacher models) lifted the data ceiling above either teacher alone.
What we used
- RAG
- QLoRA fine-tuning
- Best-of-N distillation
- vLLM
- NVIDIA Blackwell
- OLMo-3.1-32B
- Claude Opus 4.8
- CMC proposals
- Biosimilars
- LLM-as-judge
- Self-hosted LLM
Lessons that generalise
- For RAG systems, never judge a model on closed-book recall. Test open-book — it’s a different (and far better) result.
- Decompose plus concurrency beats single-shot for both speed and reliability on a local model. A full proposal as 13 concurrent sections finishes faster than one giant structured call — and the GPU batches it for free.
- Move rules into code. Deterministic validators make a weaker model safe; the model writes prose, code enforces structure and citations.
- The quality gap is mostly elaboration depth, not knowledge — and that’s the most fixable kind, addressable with prompting, few-shot, and a structured “specificity reference” in RAG.
- Use a judge panel, not one judge. A single LLM judge is noisy enough to flip a verdict (we saw 7:6 → 3:10 → 7:6). Majority-vote across position-swapped judges gives a defensible answer.
- Single-request latency is the wrong hardware metric for multi-tenant SaaS — concurrent throughput is what matters, and modern GPUs batch beautifully.
- RAG gets you to near parity cheaply; fine-tuning closes the judgment gap RAG can’t. Keep facts in RAG and teach the fine-tune behaviour, voice, and scope — not facts.
- The biggest fine-tuning wins came from fixing our own pipeline (a leaked holdout split, missing completion-only loss masking, too few optimizer steps), not from more data. Disciplined code review beat brute force.
- Best-of-N distillation beats a single teacher — keeping the validator-clean winner between two strong models lifts the data ceiling above either one alone.
We set out to prove a local model could replace a frontier cloud model here — and it does. RAG and prompt engineering reached 7:6 parity; the fine-tuning experiment we ran next closed the last regulatory-judgment gap, putting a self-hosted OLMo-3.1-32B at parity with Claude Opus 4.8, private and at fixed cost.
