Accuracy of Generative AI Chatbots in Answering Plastic Surgery Examination Questions: A Comparative Evaluation of ChatGPT-4o, Gemini Advanced, and DeepSeek-R1

    Jiaxian Zhang, Lei Huang, Xuanru Zhu, Yifan Sun, Yilong Guo, Yihong Rong, Hongwei Liu
    TLDR AI chatbots can be reliable learning tools for plastic surgery students.
    This study evaluated the accuracy of three AI chatbots—ChatGPT-4o, Gemini Advanced, and DeepSeek-R1—in answering 100 plastic surgery examination questions from Chinese universities, covering 10 subspecialties. The chatbots' responses were compared to authoritative textbook content, with accuracy scored on a Likert scale from 1 to 5. All chatbots achieved mean scores above 3.5, with DeepSeek-R1 scoring above 4.0 across all subspecialties. No significant differences in accuracy were found among the chatbots. The study suggests that AI chatbots can be reliable learning tools for plastic surgery students, though their responses to controversial topics require further validation. Limitations include the regional focus of the questions and the single-round questioning approach. Future research should explore the long-term educational benefits and ethical considerations of using AI chatbots in medical education.
    Discuss this study in the Community →