Case Study - Reaching frontier-model quality with a local model

Can a self-hosted open model replace a frontier cloud model for biosimilar and monoclonal-antibody CMC proposals — at parity quality, with full data privacy and fixed cost? Through retrieval engineering, deterministic validation, and prompt/enrichment techniques, with zero fine-tuning, a local OLMo-3.1-32B model reached 7:6 parity against Claude Opus 4.8 on a strict, multi-judge head-to-head.

Client
Synthesis API — biologics CDMO proposals
Year
Service
Local model engineering & evaluation

The problem

The proposal platform generates client-facing “Gene-to-GMP” CMC proposals across monoclonal antibodies (mAb), bispecific antibodies (bsAb), and biosimilars. Generation ran on Claude Opus 4.8 in the cloud. Two pressures pushed us toward a local model.

  • Privacy — Proposals carry sensitive CDMO pricing and scope. Multi-tenant customers (Aragen, Syngene, and others) want on-prem isolation, not “send our pricing to a cloud LLM.”
  • Cost at scale — Per-token cloud cost scales with volume; a self-hosted model is a fixed GPU cost.

The risk was the widely held assumption that open models are materially worse than frontier models. The job was to find out how much worse — and whether the gap is closable.

The hypothesis

The task is constrained, not open-ended: follow a fixed proposal structure, cite specific regulations, pull activities from a catalogue, and let a deterministic engine handle pricing. We hypothesised the frontier-model edge would be small for this shape of work, and that RAG plus validators could close most of it — because the hard part (knowledge and citations) is retrievable, not innate.

What we built

We built a decomposed, grounded generation pipeline — deliberately not a single giant LLM call.

  • Regulatory RAG corpus — 49 official guideline documents (ICH Q-series, 21 CFR, FDA/EMA/WHO biosimilar guidance, EU GMP annexes, CTD M4Q), each with provenance and trust tier. Embedded with OpenAI text-embedding-3-large; retrieval is filterable by jurisdiction and status, so drafts can never set “Required.”
  • Section-level generation — Each of the 13 workflow work-packages is generated independently and concurrently, not as one 9,000-token blob.
  • Deterministic validators — Code (not the model) enforces structure (at least 2 client inputs and at least 2 deliverables), citation traceability (every cited regulation id must appear in retrieved context), trap-phrase avoidance, and a repair loop that re-prompts on failure.

The design principle that proved decisive: move rules from prompt into code, and decompose the task into bounded pieces. A weaker model is made reliable by not asking it to hold everything at once.

The experiments

Experiment A — Closed-book recall (the wrong test). OLMo scored 0.40 on a 365-probe closed-book CMC knowledge eval (judged by GPT-5.5). It looked bad. The lesson: closed-book recall is the wrong metric for a RAG system — the model doesn’t need to memorise ICH Q5B, it needs to reason over the retrieved text.

Experiment B — Open-book, with RAG (the right test). The same model, given retrieved context, produced grounded, correctly-cited proposal sections. Across 11 molecule families × 13 workflows = 143 sections, the pipeline passed 143/143 with the repair loop. The lesson: RAG closed the knowledge gap completely.

Experiment C — Head-to-head vs Claude, identical pipeline. The real A/B: same RAG, same prompts, same validators, same repair loop — only the model was swapped. We generated all 13 sections with each model, judged by an independent panel.

RoundConfigurationResult (OLMo : Claude)
v1Baseline RAG3 : 10 — Claude clearly better
v2+ specificity prompt + few-shot + CMC-specificity reference doc in RAG + enrichment pass7 : 6 — OLMo ahead
v3Over-tightened citations3 : 10 — regressed (scope-creep)
v4Scope-locked enrichment (deepen, don’t broaden)7 : 6 — confirmed, more stable

Robust verification: the final tally was confirmed by a 3-judge panel per section, position-swapped, majority vote (78 GPT-5 judgments). Both v2 and v4 held at 7:6, with v4 showing cleaner consensus (more unanimous votes). The single-pass v3 swing proved that one judge is noise — panels are essential.

Closing the gap

A deep gap-analysis of every Claude win revealed the pattern: OLMo knew the rules but stayed generic where Claude got specific. Claude named exact attributes (deamidation, C-terminal lysine, afucosylation), named techniques (icIEF, CE-SDS, peptide mapping), mapped deliverables to CTD subsections, and reasoned about reference-product comparison. OLMo said “PTMs” and “orthogonal methods.”

The fix was four cheap, training-free levers.

LeverWhat it didEffect
1. Specificity-demanding promptRequired named CQAs, named techniques, CTD mapping, biosimilar RP-logicForced elaboration
2. Few-shot exemplarShowed the depth bar in-contextModel mimicked it
3. CMC Specificity Reference in RAGA structured doc listing named attributes / techniques / CTD per workflow, making specifics retrievableTurned an innate-capability gap into a retrieval problem
4. Scope-locked enrichment passSecond model call: deepen existing points, do not broaden the topicAdded depth without scope-creep

Net effect: 3:10 → 7:6. The cheap levers didn’t just close the gap — they reversed it.

Two precision fixes added rigor. First, validator over-strictness: the citation checker wrongly flagged CTD coordinates (3.2.S.x / 3.2.P.x) as “unsupported.” A deep per-citation analysis of 6 “failing” sections showed 5 of 6 were validator false-positives — legitimate dossier structure, not fabricated citations. Whitelisting CTD coordinates lifted validation from 7/13 → 12/13.

Second, one genuine model error: OLMo cited “EU GMP Annex 2” for Qualified-Person batch certification, which is actually governed by Annex 16. The root cause was a retrieval gap (Annex 16 was missing from the corpus, and the generic query didn’t surface it), not a hallucination. Adding Annex 16 plus a QP-aware retrieval query fixed it: the section now cites Annex 16 correctly. Final validation: 13/13.

Performance on real hardware

We served OLMo-3.1-32B (BF16, 60 GB) on a single NVIDIA RTX PRO 6000 Blackwell (96 GB) via vLLM.

MetricResult
Single-request decode~20 tok/s (BF16, memory-bandwidth bound — matches published 32B benchmarks)
Concurrent throughput (8 parallel)~172 tok/s total, same wall-time as 1 request
Concurrent throughput (24 parallel)~481 tok/s total, near-linear scaling
Full 13-section proposal (concurrent)~60 seconds

The key insight: single-request speed is the wrong metric for a multi-tenant SaaS. The Blackwell GPU batches concurrent requests — 8 proposals finish in the time of 1. Decomposing each proposal into 13 concurrent sections means a full proposal lands in ~60s, and many tenants’ proposals run in parallel. Published benchmarks show this card sustaining ~930 tok/s at 50 concurrent users.

Hardware lessons learned:

  • BF16 32B (~64 GB) does not fit a 48 GB L40S — it needs quantization or a 96 GB card. The Blackwell fits it natively.
  • Blackwell (sm_120) needed CUDA 12.9+ and FlashInfer disabled (an arch-detection bug); FlashAttention-2 worked. Bleeding-edge hardware has driver/library rough edges.
  • Use a Deep Learning AMI (drivers pre-installed), not base Ubuntu.

Cost

StateCost
Blackwell instance running~$2.5–3.5 / hr
Stopped (model + weights persist)~$16 / month (EBS only)
Cloud frontier modelper-token, scales with volume

For a high-volume, multi-tenant proposal product, the fixed-cost local model wins economically at scale — and is the only option that satisfies on-prem data privacy.

Parity vs Claude Opus 4.8
7:6
Sections passed with RAG + repair loop
143/143
Full 13-section proposal (concurrent)
~60s
Fixed on-prem cost when parked
~$16/mo

The verdict

OLMo-3.1-32B, with retrieval engineering and prompt/enrichment techniques, is at parity with Claude Opus 4.8 for biosimilar CMC proposal generation — confirmed 7:6 by a robust multi-judge panel, with zero fine-tuning.

  • Quality — 7:6 vs Claude (panel-confirmed). OLMo wins the science-depth sections (analytical, DSP, stability, USP, comparability, tech-transfer); Claude edges the regulatory-judgment/scope sections (CLD, monoclonality, RegCMC).
  • Citations — 13/13 sound, zero fabrications.
  • Speed — Full proposal ~60s; ~480 tok/s under concurrency.
  • Privacy & cost — Fully on-prem, ~$16/mo parked.

The remaining gap, and how to close it: the 6 sections Claude still wins are regulatory judgment and scope — knowing what’s appropriate to include, not just what’s specific. This is the one thing RAG and prompting can’t fully teach. The path from parity to dominance is fine-tuning (LoRA) on Aragen’s real past proposals — teaching house voice, scope judgement, and the elaboration patterns Aragen uses. Data: Aragen’s annotated proposals plus Claude-distilled synthetic pairs (filtered by the validators). That is the next investment, not a prerequisite — parity is already achieved.

What we used

  • RAG
  • vLLM
  • NVIDIA Blackwell
  • OLMo-3.1-32B
  • Claude Opus 4.8
  • CMC proposals
  • Biosimilars
  • LLM-as-judge
  • Self-hosted LLM

Lessons that generalise

  • For RAG systems, never judge a model on closed-book recall. Test open-book — it’s a different (and far better) result.
  • Decompose plus concurrency beats single-shot for both speed and reliability on a local model. A full proposal as 13 concurrent sections finishes faster than one giant structured call — and the GPU batches it for free.
  • Move rules into code. Deterministic validators make a weaker model safe; the model writes prose, code enforces structure and citations.
  • The quality gap is mostly elaboration depth, not knowledge — and that’s the most fixable kind, addressable with prompting, few-shot, and a structured “specificity reference” in RAG.
  • Use a judge panel, not one judge. A single LLM judge is noisy enough to flip a verdict (we saw 7:6 → 3:10 → 7:6). Majority-vote across position-swapped judges gives a defensible answer.
  • Single-request latency is the wrong hardware metric for multi-tenant SaaS — concurrent throughput is what matters, and modern GPUs batch beautifully.

OLMo-3.1-32B, with retrieval engineering and prompt/enrichment techniques, is at parity with Claude Opus 4.8 for biosimilar CMC proposal generation — confirmed 7:6 by a robust multi-judge panel, with zero fine-tuning.

prag-matic, Engineering

Tell us about your project

Our office

  • Bangalore
    Nubewired Software Technologies Pvt. Ltd.
    #213, Rainmakers Workspace, 2nd Floor
    Ramanashree Arcade 18, MG Road
    Bangalore - 560001, Karnataka, India
    CIN: U62013KA2024PTC186730