Case Studies - The work behind the work

Honest, technical write-ups of how we solve hard problems — the experiments that failed, the metrics that misled us, and the decisions that moved the needle. No marketing gloss, just the engineering.

Case studies

Document-generation AI — regulated life-sciences workflow

AI inference cost & latency engineering

2026

Making a frontier model ~52% cheaper and ~74% faster — without downgrading it

A document-generation product ran every long, multi-section proposal as one serial call to a premium frontier model (Claude Opus 4.8) — accurate, but slow (~160 s) and expensive. Instead of swapping in a cheaper model, we kept Opus 4.8 and re-engineered how we call it: independent section groups generated concurrently, scoped calls with prompt caching on shared context, and partial-failure salvage.

On the same frontier model, per-proposal cost dropped ~52% (≈$0.87 → ≈$0.42) and end-to-end latency ~74% (≈160 s → ≈41 s), with output quality held — verified by an independent judge, not by the team that built it.

Read case study

The reflex is to swap the frontier model for a cheaper one. We didn’t. We kept Opus 4.8 and fixed how we called it — one long serial call became concurrent, scoped section groups. Same model, same quality, ~52% cheaper and ~74% faster, with parity we could prove.

prag-matic, Engineering

Voice AI companion — crisis-safety routing

AI evaluation & safety engineering

2026

An adversarial safety harness for a voice AI agent

We set out to prove, automatically, that a voice agent keeps a distressed caller safe. A synthetic adversarial caller joins a real voice room, speaks crisis scenarios out loud, and an LLM judge scores the real agent on its exact words. Getting a trustworthy answer meant fixing the test harness more than the product — every “agent failure” turned out to be the harness mis-capturing what the agent actually said.

Once we judged the agent’s complete verbatim transcript from the database instead of a lossy re-transcription, the verdict flipped from “fails” to “passes” on all four crisis categories — and the test got sharp enough to critique therapeutic quality, not just safety.

Read case study

You cannot judge an AI you cannot faithfully observe. Once we read the agent’s verbatim words from the source of truth instead of re-transcribing its audio, the safety verdict flipped from fail to pass — and the harness graduated from “does it pass?” to “where is the craft thin?”

prag-matic, Engineering

Synthesis API — biologics CDMO proposals

Local model engineering

2026

Reaching frontier-model quality with a local model

A self-hosted OLMo-3.1-32B model reached parity with Claude Opus 4.8 on biologics CMC proposal generation — confirmed by a robust multi-judge panel. Retrieval engineering and prompt/enrichment got it to 7:6; a deliberate QLoRA fine-tuning experiment then closed the last gap on the regulatory-judgment workflows RAG couldn’t teach.

The payoff: a frontier-quality proposal engine running fully on-prem, customer-controlled, at a fixed monthly cost.

Read case study

RAG gets you to near-parity cheaply; the fine-tuning experiment we ran next closed the judgment gap RAG can’t — taking the local model from losing every hard section to matching Claude on exactly those workflows.

prag-matic, Engineering

Tell us about your project

Get in touch

Our office

Bangalore
Nubewired Software Technologies Pvt. Ltd.
#213, Rainmakers Workspace, 2nd Floor
Ramanashree Arcade 18, MG Road
Bangalore - 560001, Karnataka, India
CIN: U62013KA2024PTC186730