Abstract
This research aims to employ an artificial intelligence (AI) large language model (LLM) to generate valid single best answer (SBA) exam questions for undergraduate medical students. The objective is to design a prompt that generates SBA questions, which can be quality-assured using established methods to ensure they are valid; this will enable rapid replenishment of depleted assessment banks, which resulted from Covid-era open-book exams, and provide students with more formative assessments.
Methods A commercially available LLM (OpenAI GPT-41) was prompted to generate 200 SBA questions based on Medical Schools Council guidance and Scottish Graduate-Entry Medicine (ScotGEM) Learning Outcomes (LOs). The questions were screened to ensure they conformed with the guidelines and LO before a subset were included in an examination alongside an equal number of human-authored questions, which was undertaken by students. Facility and discrimination index was calculated for each item, and the performance of AI- and human- authored questions was compared.
Results Most AI-generated SBAs were exam-ready with little to no modifications. Adjustments were made to correct, for example, the inclusion of ‘all of the above’ answers, American spellings and non-alphabetised options.
Statistical analysis showed no significant difference between AI- and human-authored questions in terms of facility and discrimination index.2
Conclusion LLMs can produce questions adhering to best-practice guidelines and relevant LOs, though a quality-assurance process is needed to ensure proper formatting and alignment. Future work will refine AI prompts for more curriculum-specific question alignment.
Methods A commercially available LLM (OpenAI GPT-41) was prompted to generate 200 SBA questions based on Medical Schools Council guidance and Scottish Graduate-Entry Medicine (ScotGEM) Learning Outcomes (LOs). The questions were screened to ensure they conformed with the guidelines and LO before a subset were included in an examination alongside an equal number of human-authored questions, which was undertaken by students. Facility and discrimination index was calculated for each item, and the performance of AI- and human- authored questions was compared.
Results Most AI-generated SBAs were exam-ready with little to no modifications. Adjustments were made to correct, for example, the inclusion of ‘all of the above’ answers, American spellings and non-alphabetised options.
Statistical analysis showed no significant difference between AI- and human-authored questions in terms of facility and discrimination index.2
Conclusion LLMs can produce questions adhering to best-practice guidelines and relevant LOs, though a quality-assurance process is needed to ensure proper formatting and alignment. Future work will refine AI prompts for more curriculum-specific question alignment.
Original language | English |
---|---|
Journal | The Clinical Teacher |
Volume | 21 |
Issue number | S2 |
DOIs | |
Publication status | Published - 12 Nov 2024 |