Benchmark Results

Measured against the standard — and ahead of it

400 peer-reviewed clinical vignettes. The same benchmark used to evaluate Avey, Ada, WebMD, K Health, Buoy, and experienced physicians.

Top-3 Diagnostic Accuracy
91.7%
Hammoud et al. 400-vignette benchmark
Top-1 Accuracy
78.6%
Correct diagnosis as #1 pick
Across All Metrics
#1
Outperforms Avey, Ada, physicians
Sources Per Case
47+
PubMed, trials, clinical reviews

Comparative accuracy

All systems evaluated on the identical 400-vignette dataset, enabling direct comparison.

Top-1 AccuracyCorrect diagnosis as the #1 pick

Integrative Medicine AI
78.6%
Avey (Bayesian)
67.5%
Physicians (avg)
61.2%
MedAsk (GPT-4o)
58.3%
Ada
54.2%
K Health
27.8%
Buoy
26.0%
WebMD
24.5%

Top-3 AccuracyCorrect diagnosis within the first 3 picks

Integrative Medicine AI
91.7%
Avey (Bayesian)
87.3%
MedAsk (GPT-4o)
78.7%
Physicians (avg)
72.5%
Ada
71.3%
WebMD
40.7%
Buoy
40.0%
K Health
39.0%

Top-5 AccuracyCorrect diagnosis within the first 5 picks

Integrative Medicine AI
91.7%
Avey (Bayesian)
90.0%
MedAsk (GPT-4o)
82.0%
Ada
76.2%
Physicians (avg)
72.9%
WebMD
50.2%
K Health
41.5%
Buoy
40.0%

Source: Hammoud et al. 2024 (JMIR AI), SymptomCheck Bench 2024. All systems evaluated on the identical 400 peer-reviewed clinical vignettes.

Where correct diagnoses land

The correct answer is almost always the AI's first pick.

Rank 1
78.6%
Rank 2
9.6%
Rank 3
3.5%
Missed
8.3%

Honest about the edge cases

Of the missed cases, these were clinically near-correct. We show them because transparency is the point.

ExpectedAI's Top Guess
Pernicious AnemiaVitamin B12 deficiency anemia
Sickle Cell AnemiaSickle-cell disease with acute dactylitis
Infantile MeningitisAcute bacterial meningitis
Typical PneumoniaCommunity-acquired pneumonia
Diffuse Esophageal SpasmDistal Esophageal Spasm
Bacterial TonsilitisAcute streptococcal pharyngitis
Acute CholangitisAcute Cholecystitis
Chronic BronchitisCOPD exacerbation
Overflow Urinary IncontinenceMixed urinary incontinence

Methodology & caveats

We show our work because that's what we expect from our own AI.

Each of the 397 valid vignettes was run through the full clinical pipeline using GPT-5.2. The correct diagnosis was matched against the AI's ranked hypotheses using a hybrid string + LLM-judge protocol to avoid false negatives from terminology mismatches.

Important caveat: our system receives full structured vignette data upfront, whereas conversational symptom checkers gather information through simulated dialogue. The most directly comparable baseline is the physician panel, who also received complete vignette information.

Average processing time was 131 seconds per case across 4 LLM calls plus retrieval and web search.

The pipeline, per case

Seven stages from patient data to a validated clinical narrative.

  1. 1Patient data loaded
  2. 2Clinical findings extraction
  3. 3RAG knowledge base query
  4. 4Web search (supplemental)
  5. 5Diagnostic reasoning
  6. 6Clinical safety validation
  7. 7Final response generation

Questions about methodology? Let's talk.

Schedule a 30-minute demo and walk through a real clinical case with our team.