Assessment of decision-making with locally run and web-based large language models versus human board recommendations in otorhinolaryngology, head and neck surgery

Christoph Raphael Buhr*, Benjamin Philipp Ernst, Andrew Blaikie, Harry Smith, Tom Kelsey, Christoph Matthias, Maximilian Fleischmann, Florian Jungmann, Jürgen Alt, Christian Brandts, Peer W. Kämmerer, Sebastian Foersch, Sebastian Kuhn, Jonas Eckrich

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Introduction
Tumor boards are a cornerstone of modern cancer treatment. Given their advanced capabilities, the role of Large Language Models (LLMs) in generating tumor board decisions for otorhinolaryngology (ORL) head and neck surgery is gaining increasing attention. However, concerns over data protection and the use of confidential patient information in web-based LLMs have restricted their widespread adoption and hindered the exploration of their full potential. In this first study of its kind we compared standard human multidisciplinary tumor board recommendations (MDT) against a web-based LLM (ChatGPT-4o) and a locally run LLM (Llama 3) addressing data protection concerns.

Material and methods
Twenty-five simulated tumor board cases were presented to an MDT composed of specialists from otorhinolaryngology, craniomaxillofacial surgery, medical oncology, radiology, radiation oncology, and pathology. This multidisciplinary team provided a comprehensive analysis of the cases. The same cases were input into ChatGPT-4o and Llama 3 using structured prompts, and the concordance between the LLMs' and MDT’s recommendations was assessed. Four MDT members evaluated the LLMs' recommendations in terms of medical adequacy (using a six-point Likert scale) and whether the information provided could have influenced the MDT's original recommendations.

Results
ChatGPT-4o showed 84% concordance (21 out of 25 cases) and Llama 3 demonstrated 92% concordance (23 out of 25 cases) with the MDT in distinguishing between curative and palliative treatment strategies. In 64% of cases (16/25) ChatGPT-4o and in 60% of cases (15/25) Llama, identified all first-line therapy options considered by the MDT, though with varying priority. ChatGPT-4o presented all the MDT’s first-line therapies in 52% of cases (13/25), while Llama 3 offered a homologous treatment strategy in 48% of cases (12/25). Additionally, both models proposed at least one of the MDT's first-line therapies as their top recommendation in 28% of cases (7/25). The ratings for medical adequacy yielded a mean score of 4.7 (IQR: 4–6) for ChatGPT-4o and 4.3 (IQR: 3–5) for Llama 3. In 17% of the assessments (33/200), MDT members indicated that the LLM recommendations could potentially enhance the MDT's decisions.

Discussion
This study demonstrates the capability of both LLMs to provide viable therapeutic recommendations in ORL head and neck surgery. Llama 3, operating locally, bypasses many data protection issues and shows promise as a clinical tool to support MDT decisions. However at present, LLMs should augment rather than replace human decision-making.
Original languageEnglish
Pages (from-to)1593-1607
Number of pages15
JournalEuropean Archives of Oto-Rhino-Laryngology
Volume282
Issue number3
Early online date10 Jan 2025
DOIs
Publication statusPublished - 1 Mar 2025

Keywords

  • Large language models
  • LLM
  • Artificial intelligence
  • AI
  • ChatGPT
  • Llama
  • Otorhinolaryngology
  • ORL
  • Head and neck
  • Digital health
  • Chatbot
  • Language model

Fingerprint

Dive into the research topics of 'Assessment of decision-making with locally run and web-based large language models versus human board recommendations in otorhinolaryngology, head and neck surgery'. Together they form a unique fingerprint.

Cite this