Case Study - Reaching frontier-model quality with a local model

Can a self-hosted open model replace a frontier cloud model for biosimilar and monoclonal-antibody CMC proposals — at parity quality, with full data privacy and fixed cost? Retrieval engineering and prompt/enrichment got a local OLMo-3.1-32B model to 7:6 parity against Claude Opus 4.8 with zero fine-tuning; a QLoRA fine-tune then closed the last gap, reaching parity even on the regulatory-judgment workflows RAG couldn’t teach — all confirmed by a strict multi-judge panel.

Client: Synthesis API — biologics CDMO proposals
Year: 2026
Service: Local model engineering & evaluation

The problem

The proposal platform generates client-facing “Gene-to-GMP” CMC proposals across monoclonal antibodies (mAb), bispecific antibodies (bsAb), and biosimilars. Generation ran on Claude Opus 4.8 in the cloud. Two pressures pushed us toward a local model.

Privacy — Proposals carry sensitive CDMO pricing and scope. Multi-tenant customers (Aragen, Syngene, and others) want on-prem isolation, not “send our pricing to a cloud LLM.”
Cost at scale — Per-token cloud cost scales with volume; a self-hosted model is a fixed GPU cost.

The risk was the widely held assumption that open models are materially worse than frontier models. The job was to find out how much worse — and whether the gap is closable.

The hypothesis

The task is constrained, not open-ended: follow a fixed proposal structure, cite specific regulations, pull activities from a catalogue, and let a deterministic engine handle pricing. We hypothesised the frontier-model edge would be small for this shape of work, and that RAG plus validators could close most of it — because the hard part (knowledge and citations) is retrievable, not innate.

What we built

We built a decomposed, grounded generation pipeline — deliberately not a single giant LLM call.

Regulatory RAG corpus — 49 official guideline documents (ICH Q-series, 21 CFR, FDA/EMA/WHO biosimilar guidance, EU GMP annexes, CTD M4Q), each with provenance and trust tier. Embedded with OpenAI text-embedding-3-large; retrieval is filterable by jurisdiction and status, so drafts can never set “Required.”
Section-level generation — Each of the 13 workflow work-packages is generated independently and concurrently, not as one 9,000-token blob.
Deterministic validators — Code (not the model) enforces structure (at least 2 client inputs and at least 2 deliverables), citation traceability (every cited regulation id must appear in retrieved context), trap-phrase avoidance, and a repair loop that re-prompts on failure.

The design principle that proved decisive: move rules from prompt into code, and decompose the task into bounded pieces. A weaker model is made reliable by not asking it to hold everything at once.

The experiments

Experiment A — Closed-book recall (the wrong test). OLMo scored 0.40 on a 365-probe closed-book CMC knowledge eval (judged by GPT-5.5). It looked bad. The lesson: closed-book recall is the wrong metric for a RAG system — the model doesn’t need to memorise ICH Q5B, it needs to reason over the retrieved text.

Experiment B — Open-book, with RAG (the right test). The same model, given retrieved context, produced grounded, correctly-cited proposal sections. Across 11 molecule families × 13 workflows = 143 sections, the pipeline passed 143/143 with the repair loop. The lesson: RAG closed the knowledge gap completely.

Experiment C — Head-to-head vs Claude, identical pipeline. The real A/B: same RAG, same prompts, same validators, same repair loop — only the model was swapped. We generated all 13 sections with each model, judged by an independent panel.

Round	Configuration	Result (OLMo : Claude)
v1	Baseline RAG	3 : 10 — Claude clearly better
v2	+ specificity prompt + few-shot + CMC-specificity reference doc in RAG + enrichment pass	7 : 6 — OLMo ahead
v3	Over-tightened citations	3 : 10 — regressed (scope-creep)
v4	Scope-locked enrichment (deepen, don’t broaden)	7 : 6 — confirmed, more stable

Robust verification: the final tally was confirmed by a 3-judge panel per section, position-swapped, majority vote (78 GPT-5 judgments). Both v2 and v4 held at 7:6, with v4 showing cleaner consensus (more unanimous votes). The single-pass v3 swing proved that one judge is noise — panels are essential.

Closing the gap

A deep gap-analysis of every Claude win revealed the pattern: OLMo knew the rules but stayed generic where Claude got specific. Claude named exact attributes (deamidation, C-terminal lysine, afucosylation), named techniques (icIEF, CE-SDS, peptide mapping), mapped deliverables to CTD subsections, and reasoned about reference-product comparison. OLMo said “PTMs” and “orthogonal methods.”

The fix was four cheap, training-free levers.

Lever	What it did	Effect
1. Specificity-demanding prompt	Required named CQAs, named techniques, CTD mapping, biosimilar RP-logic	Forced elaboration
2. Few-shot exemplar	Showed the depth bar in-context	Model mimicked it
3. CMC Specificity Reference in RAG	A structured doc listing named attributes / techniques / CTD per workflow, making specifics retrievable	Turned an innate-capability gap into a retrieval problem
4. Scope-locked enrichment pass	Second model call: deepen existing points, do not broaden the topic	Added depth without scope-creep

Net effect: 3:10 → 7:6. The cheap levers didn’t just close the gap — they reversed it.

Two precision fixes added rigor. First, validator over-strictness: the citation checker wrongly flagged CTD coordinates (3.2.S.x / 3.2.P.x) as “unsupported.” A deep per-citation analysis of 6 “failing” sections showed 5 of 6 were validator false-positives — legitimate dossier structure, not fabricated citations. Whitelisting CTD coordinates lifted validation from 7/13 → 12/13.

Second, one genuine model error: OLMo cited “EU GMP Annex 2” for Qualified-Person batch certification, which is actually governed by Annex 16. The root cause was a retrieval gap (Annex 16 was missing from the corpus, and the generic query didn’t surface it), not a hallucination. Adding Annex 16 plus a QP-aware retrieval query fixed it: the section now cites Annex 16 correctly. Final validation: 13/13.

Performance on real hardware

We served OLMo-3.1-32B (BF16, 60 GB) on a single NVIDIA RTX PRO 6000 Blackwell (96 GB) via vLLM.

Metric	Result
Single-request decode	~20 tok/s (BF16, memory-bandwidth bound — matches published 32B benchmarks)
Concurrent throughput (8 parallel)	~172 tok/s total, same wall-time as 1 request
Concurrent throughput (24 parallel)	~481 tok/s total, near-linear scaling
Full 13-section proposal (concurrent)	~60 seconds

The key insight: single-request speed is the wrong metric for a multi-tenant SaaS. The Blackwell GPU batches concurrent requests — 8 proposals finish in the time of 1. Decomposing each proposal into 13 concurrent sections means a full proposal lands in ~60s, and many tenants’ proposals run in parallel. Published benchmarks show this card sustaining ~930 tok/s at 50 concurrent users.

Hardware lessons learned:

BF16 32B (~64 GB) does not fit a 48 GB L40S — it needs quantization or a 96 GB card. The Blackwell fits it natively.
Blackwell (sm_120) needed CUDA 12.9+ and FlashInfer disabled (an arch-detection bug); FlashAttention-2 worked. Bleeding-edge hardware has driver/library rough edges.
Use a Deep Learning AMI (drivers pre-installed), not base Ubuntu.

Cost

State	Cost
Blackwell instance running	~$2.5–3.5 / hr
Stopped (model + weights persist)	~$16 / month (EBS only)
Cloud frontier model	per-token, scales with volume

For a high-volume, multi-tenant proposal product, the fixed-cost local model wins economically at scale — and is the only option that satisfies on-prem data privacy.

Closing the last gap with fine-tuning

Reaching 7:6 on RAG alone was the right first milestone — but the panel pinpointed exactly where to invest next. The two most regulation-dense workflows, regulatory CMC and comparability / biosimilarity, still lost to Claude, because those sections need learned regulatory judgment — what to include, correct CTD mapping, scope discipline — not retrievable facts. That was the textbook case for fine-tuning, so we ran the experiment deliberately: a targeted QLoRA adapter to teach the judgment RAG can’t. It paid off — the local model went from losing every hard section to matching the frontier model on exactly those workflows.

The data — best-of-N distillation. With no real CDMO proposals to train on, we synthesised the training set: for each (family × workflow × variant) task we generated a section with both Claude Opus 4.8 and GPT-5.5 through the same RAG pipeline, then kept the deterministic winner (validator pass plus completeness coverage). The result was 312 validator-clean examples across all 8 families × 13 workflows, with teacher wins splitting Claude 187 / GPT-5.5 125 — best-of-N genuinely raised the data ceiling above either single teacher. Facts stay in RAG; the fine-tune teaches behaviour, voice, and scope — not facts.

The training — two findings that drove the gains. We fine-tuned with QLoRA (4-bit nf4 frozen base plus a LoRA adapter) on the single GPU box, served by vLLM with the adapter loaded natively (no merge). Each version was measured against the panel, and disciplined code review of the training pipeline turned up two high-leverage fixes that each produced a step-change in quality.

Version	What changed	Result (OLMo W / T / L vs Claude, 6-section panel)
v0	First fine-tune — held out the target workflows from training	0 / 1 / 5 — lost badly
v2	Trained on regcmc / comparability + completion-only loss masking + 55 examples	1 / 1 / 4
v3	273 balanced examples + all fixes + stronger LoRA config	2 / 2 / 2 — parity

Bug 1 — we were holding out the exact workflows we were graded on. v0 put every regulatory-CMC and comparability example into the eval split, so the model never saw them in training — yet they were the entire test. The fix: train on them, and hold out a family (biosimilar) instead for an honest eval.

Bug 2 — no completion-only loss masking. Loss was computed over the whole sequence, including the ~10 RAG chunks stuffed into the prompt — so 80–90% of the gradient was spent reproducing retrieved boilerplate, not the regulatory voice. The fix: patch the chat template with generation markers so loss lands only on the gold section. Combined with stronger learning dynamics (rank 16→32 with rsLoRA, lr 1e-4→2e-4, 3→10 epochs — the old run did only ~18 optimizer steps, far too few to learn a voice), output validity jumped from 2/6 to 5/6 valid sections.

The result. Judged like a competition — GPT-5.5, 3 judges, position-swapped, majority vote — v3 reached parity on the hardest panel: the 2 most regulation-dense workflows across representative families, with the biosimilar family strictly held out of training.

Section	Winner
bsAb comparability	OLMo (4–0)
mAb regulatory CMC	OLMo (3–0)
Biosimilar comparability	Tie (2–2)
Biosimilar regulatory CMC	Tie (3–3)
ADC regulatory CMC	Claude (0–6)
mAb comparability	Claude (0–5)

2 wins / 2 ties / 2 losses = parity on the hardest panel. The local model went from losing every hard section (v0) to matching the frontier model on exactly the regulatory-judgment workflows RAG couldn’t reach. In fresh blind head-to-head generation against other strong models through the same RAG pipeline, v3 won the blind GPT-5.5 judge decisively (net +5 to +7) on these in-domain sections — the preferred author, not just a competitive one. We froze v3 as the production-candidate: versioned, backed up to S3, and wired into the serving stack for a one-config cutover from the cloud model to local.

vs Claude Opus 4.8 on the hardest workflows: Parity
v3 fine-tune on the regulatory-judgment panel: 2W / 2T / 2L
Best-of-N distilled training examples: 312
Fixed on-prem cost when parked: ~$16/mo

The verdict

OLMo-3.1-32B is at parity with Claude Opus 4.8 for biologics CMC proposal generation — private, fixed-cost, and customer-controlled. RAG and prompt/enrichment techniques alone reached 7:6 on the broad workflow set; a QLoRA fine-tune (v3) then closed the last gap, reaching parity even on the regulatory-judgment workflows RAG couldn’t teach. Both results were confirmed by a robust multi-judge panel.

Quality (RAG) — 7:6 vs Claude on the broad set, zero fine-tuning. OLMo wins the science-depth sections (analytical, DSP, stability, USP, comparability, tech-transfer).
Quality (fine-tune) — 2 wins / 2 ties / 2 losses on the hardest panel — parity on exactly the regulatory-CMC and comparability workflows Claude used to dominate, plus a decisive blind-judge advantage on the broader in-domain set.
Citations — 13/13 sound, zero fabrications.
Speed — Full proposal ~60s; ~480 tok/s under concurrency.
Privacy & cost — Fully on-prem, ~$16/mo parked. v3 is frozen as the production-candidate, ready for a one-config cutover from cloud to local.

How we got there: RAG and validators close the knowledge gap cheaply; fine-tuning closes the judgment gap RAG can’t. The biggest training wins came not from more data but from fixing bugs in our own pipeline — a leaked holdout split and missing loss masking — surfaced by disciplined code review. Best-of-N distillation (keeping the validator-clean winner between two teacher models) lifted the data ceiling above either teacher alone.

What we used

RAG
QLoRA fine-tuning
Best-of-N distillation
vLLM
NVIDIA Blackwell
OLMo-3.1-32B
Claude Opus 4.8
CMC proposals
Biosimilars
LLM-as-judge
Self-hosted LLM

Lessons that generalise

For RAG systems, never judge a model on closed-book recall. Test open-book — it’s a different (and far better) result.
Decompose plus concurrency beats single-shot for both speed and reliability on a local model. A full proposal as 13 concurrent sections finishes faster than one giant structured call — and the GPU batches it for free.
Move rules into code. Deterministic validators make a weaker model safe; the model writes prose, code enforces structure and citations.
The quality gap is mostly elaboration depth, not knowledge — and that’s the most fixable kind, addressable with prompting, few-shot, and a structured “specificity reference” in RAG.
Use a judge panel, not one judge. A single LLM judge is noisy enough to flip a verdict (we saw 7:6 → 3:10 → 7:6). Majority-vote across position-swapped judges gives a defensible answer.
Single-request latency is the wrong hardware metric for multi-tenant SaaS — concurrent throughput is what matters, and modern GPUs batch beautifully.
RAG gets you to near parity cheaply; fine-tuning closes the judgment gap RAG can’t. Keep facts in RAG and teach the fine-tune behaviour, voice, and scope — not facts.
The biggest fine-tuning wins came from fixing our own pipeline (a leaked holdout split, missing completion-only loss masking, too few optimizer steps), not from more data. Disciplined code review beat brute force.
Best-of-N distillation beats a single teacher — keeping the validator-clean winner between two strong models lifts the data ceiling above either one alone.

We set out to prove a local model could replace a frontier cloud model here — and it does. RAG and prompt engineering reached 7:6 parity; the fine-tuning experiment we ran next closed the last regulatory-judgment gap, putting a self-hosted OLMo-3.1-32B at parity with Claude Opus 4.8, private and at fixed cost.

prag-matic, Engineering

Our office

Follow us