Abstract
Medical Schools face new pressures from rising workloads, expanding student populations, and budget constraints, driving the search for automated solutions in assessments. While automations have thus far focussed on the Single Best Answer format, this study investigates the potential for generative AI to assess longer pieces of written work by evaluating its reliability and alignment with human markers.
We developed a custom GPT-4-based marking system to evaluate 3 published critical reviews from open-access academic journals, serving as proxies for student dissertations. Each “dissertation” was marked 100 times. We analysed internal reliability using Cronbach’s alpha, stability through linear regression, specificity with one-way ANOVA, and precision by comparing standard deviations to historical human marking data.
The results demonstrated that the AI marker achieved robust internal reliability (Cronbach’s alpha > 0.8) across all three “dissertations”. Linear regression revealed that the AI did not become more ‘dovish’ or ‘hawkish’ over repeated markings (R2 < 0.001), and ANOVA confirmed that marks were specific to each “dissertation” (p < 0.001). Additionally, the variability in AI-generated marks was comparable to that of two independent human markers, indicating acceptable precision.
These findings suggest that AI marking systems could reliably score written work, offering potential benefits such as cost and time savings, as well as enhanced student confidence through universal double-marking. However, the accuracy of AI-generated scores relative to human markers remains to be established.
We developed a custom GPT-4-based marking system to evaluate 3 published critical reviews from open-access academic journals, serving as proxies for student dissertations. Each “dissertation” was marked 100 times. We analysed internal reliability using Cronbach’s alpha, stability through linear regression, specificity with one-way ANOVA, and precision by comparing standard deviations to historical human marking data.
The results demonstrated that the AI marker achieved robust internal reliability (Cronbach’s alpha > 0.8) across all three “dissertations”. Linear regression revealed that the AI did not become more ‘dovish’ or ‘hawkish’ over repeated markings (R2 < 0.001), and ANOVA confirmed that marks were specific to each “dissertation” (p < 0.001). Additionally, the variability in AI-generated marks was comparable to that of two independent human markers, indicating acceptable precision.
These findings suggest that AI marking systems could reliably score written work, offering potential benefits such as cost and time savings, as well as enhanced student confidence through universal double-marking. However, the accuracy of AI-generated scores relative to human markers remains to be established.
| Original language | English |
|---|---|
| Pages | 7-7 |
| DOIs | |
| Publication status | Published - 15 Dec 2025 |
| Event | ASME Annual Scholarship Meeting 2025 - Edinburgh, United Kingdom Duration: 1 Jul 2025 → 3 Jul 2025 https://www.asme.org.uk/events/asm2025/ |
Conference
| Conference | ASME Annual Scholarship Meeting 2025 |
|---|---|
| Country/Territory | United Kingdom |
| City | Edinburgh |
| Period | 1/07/25 → 3/07/25 |
| Internet address |
Keywords
- Assessment
- Coursework
- Artificial intelligence
- AI
- GenAI
Fingerprint
Dive into the research topics of 'Evaluating GPT-4's performance as an automated marker of academic assignments in medical education'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver