Abstract
Medical education faces challenges from growing student populations, diverse learning needs, and the need for high-quality clinical experiences. Large Language Models (LLMs) like GPT-4 can augment clinical interactions by providing real-time diagnostic assistance. However, the accuracy and fairness of these AI tools in diverse settings, especially in low- and middle-income countries (LMICs), remain insufficiently explored. This study evaluates GPT-4’s diagnostic accuracy for common ophthalmic conditions and examines the implications of proprietary prompts and inherent biases on equitable learning.
We developed two prompts to assess ten ophthalmic conditions relevant to both global and LMIC contexts: (1) a simple prompt and (2) a complex prompt enriched with comprehensive ophthalmic knowledge. Each condition was presented to GPT-4 one hundred times per prompt, totalling 2,000 diagnostic attempts. Sensitivity, specificity, positive predictive value, and negative predictive value were calculated for each condition. A Chi-Square Test of Independence assessed differences in diagnostic accuracy between prompts.
Results showed that the complex prompt significantly improved diagnostic accuracy compared to the simple prompt (90.1% vs. 60.4%; χ2 = 428.858, P < 0.01). Both prompts effectively diagnosed globally prevalent conditions, but the complex prompt performed better for LMIC-specific diseases. These findings suggest that prompt engineering can enhance the reliability and reduce biases of AI assistants in medical education. However, risks include scarcity of LMIC-specific training data causing inaccuracies and reliance on proprietary information exacerbating digital inequalities.
We developed two prompts to assess ten ophthalmic conditions relevant to both global and LMIC contexts: (1) a simple prompt and (2) a complex prompt enriched with comprehensive ophthalmic knowledge. Each condition was presented to GPT-4 one hundred times per prompt, totalling 2,000 diagnostic attempts. Sensitivity, specificity, positive predictive value, and negative predictive value were calculated for each condition. A Chi-Square Test of Independence assessed differences in diagnostic accuracy between prompts.
Results showed that the complex prompt significantly improved diagnostic accuracy compared to the simple prompt (90.1% vs. 60.4%; χ2 = 428.858, P < 0.01). Both prompts effectively diagnosed globally prevalent conditions, but the complex prompt performed better for LMIC-specific diseases. These findings suggest that prompt engineering can enhance the reliability and reduce biases of AI assistants in medical education. However, risks include scarcity of LMIC-specific training data causing inaccuracies and reliance on proprietary information exacerbating digital inequalities.
| Original language | English |
|---|---|
| DOIs | |
| Publication status | Published - 15 Dec 2025 |
| Event | ASME Annual Scholarship Meeting 2025 - Edinburgh, United Kingdom Duration: 1 Jul 2025 → 3 Jul 2025 https://www.asme.org.uk/events/asm2025/ |
Conference
| Conference | ASME Annual Scholarship Meeting 2025 |
|---|---|
| Country/Territory | United Kingdom |
| City | Edinburgh |
| Period | 1/07/25 → 3/07/25 |
| Internet address |
Keywords
- GenAI
- Artificial intelligence
- Ophthalmology
- AI
Fingerprint
Dive into the research topics of 'Assessing GPT-4’s diagnostic accuracy and bias in an ophthalmology educational assistant'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver