YouTube has become a dominant source of medical information and health-related decision-making. Yet, many videos on this platform contain inaccurate or biased information. Although expert reviews could help mitigate this situation, the vast number of daily uploads makes this solution impractical. In this study, we explored the potential of Large Language Models (LLMs) to assess the quality of medical content on YouTube. We collected a set of videos previously evaluated by experts and prompted twenty models to rate their quality using the DISCERN instrument. We then analyzed the inter-rater agreement between the language models’ and experts’ ratings using Brennan–Prediger’s (BP) Kappa. We found that LLMs exhibited a wide range of inter-rater agreements with the experts (ranging from −1.10 to 0.82). All models tended to give higher scores than the human experts. The agreement on individual questions tended to be lower, with some questions showing significant disagreement between models and experts. Including scoring guidelines in the prompt has improved model performance. We conclude that some LLMs are capable of evaluating the quality of medical videos. If used as stand-alone expert systems or embedded into traditional recommender systems, these models can mitigate the quality issue of health-related online videos.
Read full abstract