Live in production · Case Study

Lemma — How It Works

A commercialization evaluation platform for Technology Transfer Offices. Upload a research paper and five grounded AI agents score its commercial potential, then compose a fully-cited investor deck — on durable serverless workflows that survive failures mid-run.

5AI Agents
3Adversarial Critiques
Retries per Step
3Export Formats
The Problem

First-pass research evaluation is slow — and naive LLMs make it worse

Technology Transfer Offices sit on backlogs of research papers and need to decide which are worth commercializing. Doing this manually takes domain experts days per paper. Doing it with a single LLM prompt is fast but untrustworthy: market sizes get invented, readiness levels are guessed, and nothing in the output can be traced back to a source. Lemma's design goal was an automated first-pass evaluation in which every factual claim is traceable end-to-end — from a slide bullet in the exported deck back to a retrieved URL or an upstream agent finding.

Architecture

Three runtime planes

API routes stay deliberately fast; all LLM work happens in a durable background pipeline; progress reaches the browser through real-time events with polling as a fallback. Each plane can fail independently without taking down the others.

① Synchronous Plane Next.js 14 App Router API routes · Vercel
POST /api/upload PDF → Cloudflare R2
POST /…/analyze sets status = PROCESSING, fires Inngest event · rate-limited 5/user/hr (Upstash Redis)
GET /…/status lightweight polling endpoint
POST /…/export render deck → PDF / PPTX / DOCX → R2
event: paper/uploaded
② Asynchronous Plane — Durable Execution Inngest · one function, each agent in its own step.run()
Agent Pipeline 5 agents + 3 critique passes, sequential steps with 3× retry — a failed step retries without re-running earlier (expensive) agents
Gemini (LLM) direct REST generateContent · per-agent model + thinking config with same-provider fallback
Tavily (search) live web retrieval for market evidence, behind a swappable search interface
Neon Postgres Prisma 7 · one schema-validated model per agent output
channel: project-{id}
③ Notification Plane either path works alone — Pusher is optional at runtime
Pusher events real-time stage updates pushed to the browser
Status polling browser polls the lightweight /status endpoint as fallback
Supporting services
Clerk auth — middleware gates the workspace
Cloudflare R2 PDFs in, deck exports out (S3 SDK)
Upstash Redis per-user rate limiting
Zod every agent output validated before persistence
Puppeteer + pptxgenjs + docx deterministic deck rendering — no LLM in the export path
Agent Workflow

The analysis pipeline, step by step

Each agent's output is validated against a Zod schema before persistence — and the same schema is converted to a Gemini responseSchema, so the model is constrained at generation time and checked at parse time. Amber markers are adversarial critique passes.

UPLOAD PDF → R2 → Inngest event in: PDF · out: paper/uploaded

The analyze endpoint only flips status to PROCESSING and fires the event — no LLM work on the request path, so the API responds in milliseconds.

AGENT 1 Paper Analyst in: PDF · out: PaperData

Extracts abstract, novelty, domain, and key claims with confidence levels. Rejects non-research documents (financial reports, textbooks) outright.

CRITIQUE Skeptical Review in: PaperData + PDF · out: findings

An adversarial agent audits the analysis against the source PDF.

⚠ CRITICAL findings trigger exactly one regeneration of Agent 1 — bounded by design so critique loops can't run away. An unusable critique falls back to the original analysis.
AGENT 2 TRL/IRL Scorer in: PaperData only · out: TrlIrlData

Scores Technology and Investment Readiness Levels, suggests a commercialization pathway (spin-off / licensing / partnership), and flags risks.

✓ Deliberately never re-reads the PDF — it consumes Agent 1's structured output, keeping the context small and the reasoning auditable.
AGENT 3 Market Scout — retrieval, then synthesis in: domain + claims · out: MarketData + sources

Stage 1 is retrieval-only: Tavily searches for competitors, funding signals, patents, and market sizing; results are persisted verbatim. Stage 2 lets Gemini see only the retrieved sources — a validator rejects any figure whose sourceUrl isn't in the retrieved set, feeding errors back into a bounded retry loop (max 3 attempts).

✓ Skip-don't-fail: if retrieval or synthesis fails, the pipeline continues without market data instead of dying. Ungroundable figures come back null — never invented.
AGENT 4 Feasibility Scout in: Agents 1+2 (+market) · out: FeasibilityData

Pure reasoning, no retrieval. Timeline and capital estimates are explicit ranges with required confidence and reasoning fields — the schema rejects max ≤ min as a false-precision guard.

CRITIQUE Feasibility Audit traceability + over-confidence check

Audits reasoning traceability and over-confident estimates; one regeneration allowed on CRITICAL findings.

AGENT 5 Pitch Builder in: all upstream outputs · out: DeckData.slides

Composes investor-deck slides. A ref menu enumerates every citable upstream fact (e.g. market.tam, paper.keyClaims[2]) as the model's only citation vocabulary — it structurally cannot invent facts.

⚠ Two guardrails: a structural validator rejects invented refs and source-URL mismatches; a semantic critique flags claims whose values don't match upstream findings.
REVIEW Human Review → Export out: PDF · PPTX · DOCX

TTO staff evaluate results before export. All three exporters consume the same normalized RenderDeck model, so the formats cannot disagree on content or grounding — and every export renders visible per-slide Sources citations.

Technical Challenges → Solutions

The hard parts, and how they were solved

Every problem below is a general production-LLM problem — context limits, hallucination, orchestration, long-running tasks — solved with specific, verifiable mechanisms in the Lemma codebase.

Challenge · Context Windows

A full paper doesn't fit every prompt

Re-feeding the entire PDF to all five agents would blow up token costs, latency, and attention quality — later agents would drown in raw text.

Solution · Structured Hand-offs

Only Agent 1 reads the PDF. Every downstream agent consumes compact, schema-validated structured outputs (PaperData, TrlIrlData…) — prompt chaining with typed contracts instead of raw-text relay. The TRL scorer is deliberately forbidden from re-reading the PDF.

Challenge · Hallucination

LLMs invent market figures

Market sizing is exactly where investors check numbers — an invented TAM kills credibility, and a single-prompt approach invents them constantly.

Solution · Closed-World Retrieval

Retrieval and synthesis are split into separate stages. The synthesis model sees only persisted Tavily results; a validator rejects any figure citing a URL outside the retrieved set and feeds the Zod errors back into a bounded retry prompt. Ungroundable figures return null.

Challenge · Long-Running Tasks

Serverless functions time out

A five-agent pipeline with retries runs for minutes — far beyond Vercel's request limits — and a crash at agent 4 must not re-bill agents 1–3.

Solution · Durable Execution (Inngest)

The whole pipeline is one Inngest function with each agent isolated in its own step.run(). Steps are checkpointed: a failed step retries up to 3× without re-running earlier, expensive agents. The API route just fires an event and returns — async by construction.

Challenge · Orchestration

Five agents must stay mutually consistent

Multi-agent pipelines drift: a deck slide can quietly contradict the feasibility estimate it was supposedly built from.

Solution · Ref-Menu Citations + Critique Agents

The pitch builder may only cite facts from an enumerated menu of upstream findings, with source URLs carried through unchanged. Adversarial critique agents audit at three points (paper, feasibility, pitch), each allowed exactly one regeneration to prevent loops.

Challenge · Partial Failure

External dependencies flake

Web search goes down, models return 503s mid-pipeline. Failing the whole run for a missing market section wastes everything already computed.

Solution · Explicit Failure Semantics

Each stage declares its failure mode: core agents retry then fail the project with a notification; market and feasibility skip, don't fail; the deck gracefully omits the market slide when the data is absent. The Gemini client falls back to a same-provider backup model on retryable errors.

Challenge · Schema Drift

LLM output shapes are unreliable

Free-form JSON from a model breaks parsers, and provider-side schema support doesn't cover constraints like ranges, unions, or regex patterns.

Solution · Dual Enforcement (Zod + responseSchema)

One Zod schema per agent is both converted into a Gemini responseSchema (constraining generation) and run as safeParse before persistence (source of truth). Constraints Gemini can't express are still enforced by Zod — e.g. rejecting capital ranges where max ≤ min.

Tech Stack

Every layer, and why it's there

Verified against the public repository — no résumé padding.

Layer Technology
FrameworkNext.js 14 (App Router) · TypeScript · React 18
DatabaseNeon Postgres via Prisma 7 — one model per agent output, multi-tenant by institution
Background jobsInngest — durable execution, step-level checkpointing and retries
LLMGoogle Gemini — direct REST, per-agent model + thinking-level config, automatic fallback model
Web searchTavily, behind a swappable search-client interface
ValidationZod — every agent output schema-validated before persistence
AuthClerk — middleware-gated workspace and onboarding
StorageCloudflare R2 (S3 SDK) — PDFs in, deck exports out
Real-timePusher (optional at runtime) + status polling fallback
Rate limitingUpstash Redis — 5 analyses per user per hour
Deck exportpuppeteer-core + @sparticuz/chromium (PDF) · pptxgenjs (PPTX) · docx (DOCX)
TestingVitest — including fault-injection via a swappable LLM transport
Product

What it looks like

Lemma — research to fundable spinout
Lemma — research portfolio dashboard
Lemma — six-stage analysis pipeline

Want to talk architecture?

I'm happy to walk through any of these decisions — the trade-offs, what broke, and what I'd do differently at 10× scale.