Personal Science Week - 260423 OpenEvidence
Testing a doctor-grade AI search engine
Personal scientists get help wherever we can, including from LLMs. But for serious medical advice, should we rely on consumer-focused Claude and ChatGPT or go for something that medical doctors use?
This week we evaluate OpenEvidence, a site that promises high-quality medical search results—for doctors only.
More and more people are using LLMs for medical questions. According to OpenAI’s January 2026 report, more than 5% of all ChatGPT messages globally are about healthcare — billions of messages a week. About 40 million people ask ChatGPT a healthcare question every day. Naturally, this is making a lot of experts nervous. What if the chatbot gets it wrong? What if a patient acts on bad advice?
Personal scientists take the worry seriously but draw a different conclusion. We trust no one — not a chatbot, not an “expert,” not a glossy consensus guideline — because nobody cares more about your health, or your family’s health, than you do. It’s true that a year or two ago, LLMs were often deceptively inaccurate, but they’ve gotten substantially better, even over the past six months, and they continue to improve.
Still: when you really need the right answer, maybe a general-purpose chatbot isn’t enough, so on the recommendation of a doctor I respect, I tried OpenEvidence.
OpenEvidence is a retrieval-augmented LLM optimized for clinical decision-making. Give it a clinical question and it searches peer-reviewed literature — NEJM, JAMA, The Lancet, Cochrane, plus 300+ journals and FDA/CDC guidance — then synthesizes an answer grounded in those papers with inline citations you can click through to the abstracts. And it’s very popular: About 40% of US physicians use it, across more than 10,000 hospitals.
One catch: it’s only for doctors. Registration requires a National Provider Identifier (NPI) — the 10-digit number CMS issues to healthcare providers. No, you can’t just get your own NPI. It’s a federal crime to imitate a doctor, so don’t even try.
The gate isn’t really about protecting the public from dangerous information. It’s a condition of the publishing deals — and a feature of the business model. OpenEvidence is monetized through pharmaceutical and medical-device advertising at reported CPMs of $70 to $1,000+ (vs. $5–15 for social media). The audience being sold to advertisers is specifically prescribers. Non-prescribers aren’t the customer, so keeping us out is the point.
PSWeek has discussed other academic-focused research tools over the years — Elicit (PSWeek240222), Consensus (PSWeek250601) , FutureHouse (PSWeek250612). Most of them are happy to take anyone’s money. OpenEvidence is the first one I’ve run into that actively refuses civilians.
Testing It Anyway
Fortunately, OpenEvidence’s homepage accepts one query without full registration — presumably rate-limited by IP, though I couldn’t confirm how many you get before being cut off. Enough to run one real test.
I gave both OpenEvidence and Claude the same question — a topic I’ve been thinking about as a possible self-experiment (see PSWeek240905):
This is a good stress test: it spans diagnostic and therapeutic literature, the age qualifier forces the model beyond easy geriatric studies, and the evidence base is genuinely complex.
What OpenEvidence delivered
A clean, authoritative, well-cited response. Every claim linked to a specific PubMed ID. The key references were exactly what a cardiologist would cite.
The quantitative specifics were precise:
7% increased cardiovascular risk per 1 m/s increase in aortic PWV after full adjustment,
approximately doubled risk in participants aged 60 or younger,
aerobic exercise reductions of −0.75 to −1.02 m/s in central PWV,
typical programs of 40 minutes × 3 days/week × 11 weeks.
It reads like an UpToDate entry — and for a physician with fifteen minutes between patients, that’s exactly what you want.
What Claude delivered
A longer, more discursive, and far more epistemically honest answer.
Where OpenEvidence stated that PWV predicts events in adults under 50 with high confidence, Claude flagged exactly where that evidence gets thin: the key meta-analyses skew heavily toward older cohorts, the subgroup data for younger adults has wide confidence intervals, plus other caveats.
Claude also raised points you won’t see in a clinical summary: a physics-level critique of the statistical methodology. Similarly, with interventions, Claude was more nuanced: Aerobic exercise got the strongest endorsement (consistent with OpenEvidence), but Claude was honest about the weaker evidence like statins and omega-3s.
Best of all, Claude added something OpenEvidence never would: a personal science section on what it would actually take to run a self-experiment on PWV, including the measurement-noise problem and the distinction between consumer brachial-ankle PWV devices and the gold-standard carotid-femoral method.
Why Claude Wins (For Personal Scientists)
The comparison illustrates the difference between “what” knowledge and “how” knowledge — a distinction we’ve been exploring in PSWeek for a while now.
OpenEvidence knows how to answer your question — efficiently, accurately, well-cited. Claude helps more on the what to look for in the first place— something only a human can decide.
To be fair to OpenEvidence, part of the divergence is by design. A tool built to serve a physician at the point of care should deliver a crisp, actionable summary, not a rumination on the limits of the literature. But that’s also the limitation: the same interface that is perfect for a busy clinician is wrong for a personal scientist deciding whether a biomarker is worth tracking, Claude’s critical thinking is far more valuable — precisely because it’s less confident and more epistemically careful. The best medical-AI experience for a personal scientist turns out to be the general-purpose frontier model we already have.
My recommendation for 2026: Use frontier models as your primary medical research tool. If you want extra confidence in the results — especially for anything you might act on — ask a different frontier model to check the work. Run the same question through Claude and ChatGPT (or Gemini, or Grok). Where they agree, you can be fairly confident. Where they diverge, you’ve found exactly the places where the evidence is genuinely uncertain — which is the most valuable information of all.
Personal Science Weekly Readings
Speaking of medical advice, we should also mention three recent studies that argue the case against trusting chatbots:
Tiller et al. in BMJ Open red-teamed five popular models with 50 questions each on cancer, vaccines, stem cells, nutrition, and athletic performance; Half the answers were “problematic” and none got all the answers. But come on! these were running ancient February 2025 free models with with prompts deliberately designed to elicit bad answers. I’m not persuaded.
Rao et al. in JAMA Network Open tested 21 frontier LLMs (including GPT-5, Claude 4.5 Opus, Grok 4) on 29 standardized clinical vignettes and found that even top models fail more than 80% of the time at generating alternate diagnoses— though they nail the final diagnosis once enough information is on the table.
The most uncomfortable result comes from Bean et al. in Nature Medicine, a preregistered randomized trial with 1,298 UK participants: the LLMs alone correctly identified the relevant condition in ~95% of cases, but participants using the same models got it right only ~34% of the time — no better than people Googling the NHS website. But again, they were using models from 2024 (ancient history) and these were not sophisticated users.
My main complaint with all three studies: none of them includes a physician baseline on the same tasks. The interesting question isn’t whether the chatbot is sometimes wrong — it’s whether it’s wronger than the alternative the patient would otherwise have used. In my experience, a skeptical, well-prompted user with a frontier general-purpose LLM and the habit of cross-checking can do better than 95% of the population using the exact same model. The implication for personal scientists is the one we keep relearning: the tool is fine; the bottleneck is the person operating it.
About Personal Science
The NPI wall isn’t going away — but neither is the basic truth behind personal science: you are the most interested, most patient, and most appropriately paranoid researcher of your own body. No credentialing system selects for that. The good news is that the tools a curious amateur can reach in 2026 are genuinely good enough to do the work — and in some dimensions, better than the ones gated behind professional credentials.
As we like to say: nullius in verba — take no one’s word for it. Not even the AI’s.
Personal Science is delivered every Thursday to anyone who thinks science is just as useful for everyday, personal reasons as it is for professionals. If you have topics you’d like us to explore, please let us know.





