Thought Leadership

Why We Curate, Not Scrape — Building Clinical-Grade Genomic Intelligence

A hand-built 150,000-reference knowledge graph beats a scraped one — and in clinical genomics, that difference is measured in lives, not latency.

Pranav Anam · Founder, The Gene Box·July 3, 2026·6 min read

There is a shortcut available to anyone building a genomics platform today: point a large language model at the internet, let it read everything, and ask it to explain what a variant means. It is fast, it is cheap, and it is wrong often enough to be dangerous. We chose the slower path on purpose.

When a clinician acts on a genomic report — adjusts a drug, orders a follow-up, counsels a family — they are trusting the evidence underneath that report. Not the interface. Not the AI. The evidence. And the honest truth about genomic evidence is that it is messy, contested, and constantly moving. Findings get overturned. Preprints get walked back. A gene–drug interaction that looked solid in one population turns out not to replicate in another. Any system that reads the literature indiscriminately inherits all of that noise and presents it with the same confident tone as settled science.

A wrong gene–drug call isn't a bug. It's a clinical event.

The core wager behind The Gene Box

The problem with scraping science

Scraping optimises for coverage. It will happily ingest a case report, a review article, a retracted paper, and a marketing blog post, and treat them as roughly equivalent inputs. Language models layer a second problem on top: they generate fluent, plausible text whether or not the underlying claim is true, and they cannot reliably tell you which parts of an answer are grounded and which are confabulated. In most domains that produces an embarrassing mistake. In genomics it produces a recommendation that a real person might follow.

The failure modes are specific and predictable. Evidence that was never replicated gets cited as established. A study's caveats — the sample size, the ancestry of the cohort, the effect size — get stripped away in summarisation. Retractions and corrections, which are exactly the signal you most need, are the easiest thing to miss. The result reads beautifully and cannot be trusted.

What curation actually means here

Our knowledge graph is built by hand, in stages, with a human accountable at each one. Sources are selected, not swept in. Claims are extracted and normalised into a consistent structure so that a gene–drug relationship from one database can be reconciled against another. Every relationship is graded for the strength of evidence behind it, so a well-replicated finding and a single-study hint are never presented as equals. And nothing is published into the graph without expert validation. It is a five-stage pipeline — source, extract, normalise, validate, publish — and the whole point of it is that a person can stand behind every edge in the graph.

This is why the platform can say why it reached a conclusion, not just what the conclusion is. Every interpretation traces back to graded, human-validated evidence — the difference between a report a clinician can defend and one they simply have to trust.

Why curation compounds

Here is the part that makes the slow path worth it: a curated knowledge graph gets more trustworthy over time, while a scraped one inherits the web's decay. Every validated relationship we add strengthens the connections around it. Corrections propagate. Grading improves as evidence accumulates. The asset appreciates. A system that re-scrapes the internet on every query, by contrast, is only ever as reliable as the noisiest thing it read this morning.

What this means if you put your brand on it

Most of the organisations we work with — hospitals, diagnostic labs, wellness clinics — deliver these reports under their own name. Their patients never see ours. That arrangement only works if the floor is genuinely clinical-grade, because the reputational risk sits with them. Curation is not a philosophical preference we indulge; it is the precondition for a partner being willing to sign their name to the output.

Scraping would have let us launch sooner and claim a bigger number. Curation is the reason a clinician can read one of our reports, follow the reasoning to its evidence, and decide for themselves whether to act on it. In this field, that is the only kind of intelligence worth building. You can see how the pipeline works on our science page, or explore the platform it powers.

Why We Curate, Not Scrape — Building Clinical-Grade Genomic Intelligence

The problem with scraping science

What curation actually means here

Why curation compounds

What this means if you put your brand on it

Keep reading

The Economics of Precision Health for Hospitals — ROI Without a Lab

How Multi-Omic Testing Changed an Athlete's Career

See it under your brand