Suitability of GPT-4o as an evaluator of cardiopulmonary

Resuscitation · Nov 2024

Suitability of GPT-4o as an evaluator of cardiopulmonary resuscitation skills examinations.

To assess the accuracy and reliability of GPT-4o for scoring examinees' performance on cardiopulmonary resuscitation (CPR) skills tests. ⋯ GPT-4o demonstrated a level of accuracy that was similar to that of senior experts in examining CPR skills examination videos. The results demonstrate the potential for deploying this large language model in medical examination settings.

explore further… or not…
- Lu Wang, Yuqiang Mao, Lin Wang, Yujie Sun, Jiangdian Song, and Yang Zhang.
- Shengjing Hospital of China Medical University, Shenyang, Liaoning 110004, China; School of Health Management, China Medical University, Shenyang, Liaoning 110122, China.
- Resuscitation. 2024 Nov 1; 204: 110404110404.
AimTo assess the accuracy and reliability of GPT-4o for scoring examinees' performance on cardiopulmonary resuscitation (CPR) skills tests.MethodsThis study included six experts certified to supervise the national medical licensing examination (three junior and three senior) who reviewed the CPR skills test videos across 103 examinees. All videos reviewed by the experts were subjected to automated assessment by GPT-4o. Both the experts and GPT-4o scored the videos across four sections: patient assessment, chest compressions, rescue breathing, and repeated operations. The experts subsequently rated GPT-4o's reliability on a 5-point Likert scale (1, completely unreliable; 5, completely reliable). GPT-4o's accuracy was evaluated using the intraclass correlation coefficient (for the first three sections) and Fleiss' Kappa (for the last section) to assess the agreement between its scores vs. those of the experts.ResultsThe mean accuracy scores for the patient assessment, chest compressions, rescue breathing, and repeated operation sections were 0.65, 0.58, 0.60, and 0.31, respectively, when comparing the GPT-4o's vs. junior experts' scores and 0.75, 0.65, 0.72, and 0.41, respectively, when comparing the GPT-4o's vs. senior experts' scores. For reliability, the median Likert scale scores were 4.00 (interquartile range [IQR] = 3.66-4.33, mean [standard deviation] = 3.95 [0.55]) and 4.33 (4.00-4.67, 4.29 [0.50]) for the junior and senior experts, respectively.ConclusionsGPT-4o demonstrated a level of accuracy that was similar to that of senior experts in examining CPR skills examination videos. The results demonstrate the potential for deploying this large language model in medical examination settings.Copyright © 2024 The Author(s). Published by Elsevier B.V. All rights reserved.

Pubmed Copy Citation Plaintext

Add institutional full text...
Notes
Knowledge, pearl, summary or comment to share?

300 characters remaining

help

You can also include formatting, links, images and footnotes in your notes

Simple formatting can be added to notes, such as *italics*, _underline_ or **bold**.

Superscript can be denoted by <sup>text</sup> and subscript <sub>text</sub>.

Numbered or bulleted lists can be created using either numbered lines 1. 2. 3., hyphens - or asterisks *.

Links can be included with: [my link to pubmed](http://pubmed.com)

Images can be included with: ![alt text](https://bestmedicaljournal.com/study_graph.jpg "Image Title Text")

For footnotes use [^1](This is a footnote.) inline.

Or use an inline reference [^1] to refer to a longer footnote elseweher in the document [^1]: This is a long footnote..
hide…

Suitability of GPT-4o as an evaluator of cardiopulmonary resuscitation skills examinations.

Notes

300 characters remaining

help

You can also include formatting, links, images and footnotes in your notes