TY - JOUR
T1 - Affective-ROPTester
T2 - capability and bias analysis of LLMs in predicting retinopathy of prematurity
AU - Zhao, Shuai
AU - Zhang, Yulin
AU - Xiao, Luwei
AU - Wu, Xinyi
AU - Jia, Yanhao
AU - Guo, Zhongliang
AU - Wu, Xiaobao
AU - Nguyen, Cong Duy
AU - Zhang, Guoming
AU - Luu, Anh Tuan
PY - 2025/11/11
Y1 - 2025/11/11
N2 - Despite the remarkable progress of large language models (LLMs) across various domains, their capacity to predict retinopathy of prematurity (ROP) risk remains largely unexplored. To address this gap, we introduce a novel Chinese benchmark dataset, termed CROP, comprising 993 admission records annotated with low, medium, and high-risk labels. To systematically examine the predictive capabilities and affective biases of LLMs in ROP risk stratification, we propose Affective-ROPTester, an automated evaluation framework incorporating three prompting strategies: Instruction-based, Chain-of-Thought (CoT), and In-Context Learning (ICL). The Instruction scheme assesses LLMs' intrinsic knowledge and associated biases, whereas the CoT and ICL schemes leverage external medical knowledge to enhance predictive accuracy. Crucially, we integrate emotional elements at the prompt level to investigate how different affective framings influence the model's ability to predict ROP and its bias patterns. Empirical results derived from the CROP dataset yield three principal observations. First, LLMs demonstrate limited efficacy in ROP risk prediction when operating solely on intrinsic knowledge, yet exhibit marked performance gains when augmented with structured external inputs. Second, affective biases are evident in the model outputs, with a consistent inclination toward overestimating medium- and high-risk cases. Third, compared to negative emotions, positive emotional framing contributes to mitigating predictive bias in model outputs. These findings highlight the critical role of affect-sensitive prompt engineering in enhancing diagnostic reliability and emphasize the utility of Affective-ROPTester as a framework for evaluating and mitigating affective bias in clinical language modeling systems. We aspire for our proposal to deeply explore the capabilities of LLMs for ROP prediction and contribute to the advancement of the healthcare community.
AB - Despite the remarkable progress of large language models (LLMs) across various domains, their capacity to predict retinopathy of prematurity (ROP) risk remains largely unexplored. To address this gap, we introduce a novel Chinese benchmark dataset, termed CROP, comprising 993 admission records annotated with low, medium, and high-risk labels. To systematically examine the predictive capabilities and affective biases of LLMs in ROP risk stratification, we propose Affective-ROPTester, an automated evaluation framework incorporating three prompting strategies: Instruction-based, Chain-of-Thought (CoT), and In-Context Learning (ICL). The Instruction scheme assesses LLMs' intrinsic knowledge and associated biases, whereas the CoT and ICL schemes leverage external medical knowledge to enhance predictive accuracy. Crucially, we integrate emotional elements at the prompt level to investigate how different affective framings influence the model's ability to predict ROP and its bias patterns. Empirical results derived from the CROP dataset yield three principal observations. First, LLMs demonstrate limited efficacy in ROP risk prediction when operating solely on intrinsic knowledge, yet exhibit marked performance gains when augmented with structured external inputs. Second, affective biases are evident in the model outputs, with a consistent inclination toward overestimating medium- and high-risk cases. Third, compared to negative emotions, positive emotional framing contributes to mitigating predictive bias in model outputs. These findings highlight the critical role of affect-sensitive prompt engineering in enhancing diagnostic reliability and emphasize the utility of Affective-ROPTester as a framework for evaluating and mitigating affective bias in clinical language modeling systems. We aspire for our proposal to deeply explore the capabilities of LLMs for ROP prediction and contribute to the advancement of the healthcare community.
KW - Affective
KW - Chain-of-thought
KW - In-context learning
KW - Large Language Models
KW - Retinopathy of prematurity
U2 - 10.1109/TAFFC.2025.3631581
DO - 10.1109/TAFFC.2025.3631581
M3 - Article
AN - SCOPUS:105021497682
SN - 1949-3045
VL - Early Access
SP - 1
EP - 14
JO - IEEE Transactions on Affective Computing
JF - IEEE Transactions on Affective Computing
M1 - 11240127
ER -