Abstract
Introduction:
Artificial intelligence (AI) is currently being trialled for applications within medical education. Large language models (LLMs) could be prompted to respond as a virtual simulated patient, that a student could interact with to practice history taking. Concerns about AI safety include a potential lack of diversity in the outputs of LLMs. The aim of the study was to determine whether there are gender-based inequalities in the output of GPT-4 (Generative Pre-trained Transformer) in the context of medical education. GPT-4 is a popular LLM created by OpenAI.
Method:
Firstly, a literature review identified areas in which recent research has found evidence of gender-based inequalities in medical practice. Secondly, a prompt was generated which could be used by a medical student practising history taking with GPT-4. Thirdly, the prompt was adapted to scenarios taken from the literature review. Tests were then run to determine the distribution of genders produced by GPT-4 using several scenarios. Tests also studied whether the history given to the student by the virtual simulated patient would vary according to gender.
Results:
The results found that the gender of virtual simulated patients for most scenarios tested was significantly different from the expected distribution of genders. The history given by patients of different genders was not significantly different. A suggested method for improving the diversity of the output of LLMs was trialled with some success.
Conclusions:
In conclusion, educators looking to use innovations in AI should consider safety, equality, diversity and inclusion during the design phase of their project.
Artificial intelligence (AI) is currently being trialled for applications within medical education. Large language models (LLMs) could be prompted to respond as a virtual simulated patient, that a student could interact with to practice history taking. Concerns about AI safety include a potential lack of diversity in the outputs of LLMs. The aim of the study was to determine whether there are gender-based inequalities in the output of GPT-4 (Generative Pre-trained Transformer) in the context of medical education. GPT-4 is a popular LLM created by OpenAI.
Method:
Firstly, a literature review identified areas in which recent research has found evidence of gender-based inequalities in medical practice. Secondly, a prompt was generated which could be used by a medical student practising history taking with GPT-4. Thirdly, the prompt was adapted to scenarios taken from the literature review. Tests were then run to determine the distribution of genders produced by GPT-4 using several scenarios. Tests also studied whether the history given to the student by the virtual simulated patient would vary according to gender.
Results:
The results found that the gender of virtual simulated patients for most scenarios tested was significantly different from the expected distribution of genders. The history given by patients of different genders was not significantly different. A suggested method for improving the diversity of the output of LLMs was trialled with some success.
Conclusions:
In conclusion, educators looking to use innovations in AI should consider safety, equality, diversity and inclusion during the design phase of their project.
| Original language | English |
|---|---|
| DOIs | |
| Publication status | Published - 15 Dec 2025 |
| Event | ASME Annual Scholarship Meeting 2025 - Edinburgh, United Kingdom Duration: 1 Jul 2025 → 3 Jul 2025 https://www.asme.org.uk/events/asm2025/ |
Conference
| Conference | ASME Annual Scholarship Meeting 2025 |
|---|---|
| Country/Territory | United Kingdom |
| City | Edinburgh |
| Period | 1/07/25 → 3/07/25 |
| Internet address |
Keywords
- Medical
- Education
- Undergraduate
- AI
- Bias
Fingerprint
Dive into the research topics of 'Combating gender bias in generative large language models for medical education'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver