Measured against the standard — and ahead of it
400 peer-reviewed clinical vignettes. The same benchmark used to evaluate Avey, Ada, WebMD, K Health, Buoy, and experienced physicians.
- Top-3 Diagnostic Accuracy
- 91.7%
- Hammoud et al. 400-vignette benchmark
- Top-1 Accuracy
- 78.6%
- Correct diagnosis as #1 pick
- Across All Metrics
- #1
- Outperforms Avey, Ada, physicians
- Sources Per Case
- 47+
- PubMed, trials, clinical reviews
Comparative accuracy
All systems evaluated on the identical 400-vignette dataset, enabling direct comparison.
Top-1 Accuracy— Correct diagnosis as the #1 pick
Top-3 Accuracy— Correct diagnosis within the first 3 picks
Top-5 Accuracy— Correct diagnosis within the first 5 picks
Source: Hammoud et al. 2024 (JMIR AI), SymptomCheck Bench 2024. All systems evaluated on the identical 400 peer-reviewed clinical vignettes.
Where correct diagnoses land
The correct answer is almost always the AI's first pick.
Honest about the edge cases
Of the missed cases, these were clinically near-correct. We show them because transparency is the point.
| Expected | AI's Top Guess | Clinical Relationship |
|---|---|---|
| Pernicious Anemia | Vitamin B12 deficiency anemia | Pernicious anemia is a type of B12 deficiency |
| Sickle Cell Anemia | Sickle-cell disease with acute dactylitis | Same disease, specific presentation |
| Infantile Meningitis | Acute bacterial meningitis | Same condition, age-specific naming |
| Typical Pneumonia | Community-acquired pneumonia | Standard vs. classification naming |
| Diffuse Esophageal Spasm | Distal Esophageal Spasm | Subtype distinction |
| Bacterial Tonsilitis | Acute streptococcal pharyngitis | Overlapping anatomical region |
| Acute Cholangitis | Acute Cholecystitis | Adjacent biliary diagnoses |
| Chronic Bronchitis | COPD exacerbation | COPD encompasses chronic bronchitis |
| Overflow Urinary Incontinence | Mixed urinary incontinence | Incontinence subtype distinction |
Methodology & caveats
We show our work because that's what we expect from our own AI.
Each of the 397 valid vignettes was run through the full clinical pipeline using GPT-5.2. The correct diagnosis was matched against the AI's ranked hypotheses using a hybrid string + LLM-judge protocol to avoid false negatives from terminology mismatches.
Important caveat: our system receives full structured vignette data upfront, whereas conversational symptom checkers gather information through simulated dialogue. The most directly comparable baseline is the physician panel, who also received complete vignette information.
Average processing time was 131 seconds per case across 4 LLM calls plus retrieval and web search.
The pipeline, per case
Seven stages from patient data to a validated clinical narrative.
- 1Patient data loaded
- 2Clinical findings extraction
- 3RAG knowledge base query
- 4Web search (supplemental)
- 5Diagnostic reasoning
- 6Clinical safety validation
- 7Final response generation
Questions about methodology? Let's talk.
Schedule a 30-minute demo and walk through a real clinical case with our team.