Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

A Cross-Sectional Descriptive Study of Comparative Accuracy of ChatGPT, Google Gemini, and Microsoft Copilot in Solving NEET PG Medical Entrance Test

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Background: Artificial intelligence Chatbots (AI Chatbots) can assist medical students in preparation and cracking the different exams. Existing literature shows their accuracy varies while doing this. Present study aims to fill this gap by comparing accuracy of 3 AI Chatbots. Objectives: Primary objective was to assess and compare the accuracy of ChatGPT-4, Google Gemini and Microsoft Copilot in solving the NEET PG 2023 exam. Secondary objective was to compare their accuracy for different types of questions and questions across different medical subjects. Methods: All 200 questions of NEET PG 2023 exam paper were presented 'as it is' to the three AI chatbots. Accuracy was assessed as percentage of correct responses to each question. We compared their overall accuracy, and accuracy for different question types and subject taxonomy. Results: Accuracy of Microsoft Copilot was 82.5%, ChatGPT was 80.5%, and Google Gemini 77.5% (Chi Square1.6, p=0.4). Performance of three AI Chat Bots didn't differ in terms of different subjects (Chi Square=2.7, p=0.9) or types of questions (Chi Square=0.35, p=0.9). Conclusion: All three AI Chatbots showed good accuracy with no significant difference in solving NEET PG exam questions. There was no difference in accuracy of the three chatbots in terms of subject taxonomy of questions or the type of question.

Similar Papers
  • Research Article
  • 10.22275/see.22.2.02
Effects of Teacher Question Types on Developing L2 Learners’ English Ability and Creativity
  • Jun 30, 2017
  • Studies in English Education
  • Jee Hyun Ma

This study aims to explore whether different types of teacher questions influence the growth of creativity and English ability of Korean middle school learners of English. One hundred two middle school students participated in the current study and were assigned to either the experimental group (n=51) or the control group (n=51). The participants engaged in reading tasks along with different types of teacher questions over a period of five weeks. The experimental group was given fat questions facilitating deep and creative thinking while the control group received skinny questions having a definite answer and thus restricted discussion and debate. All the participants completed pre- and post-English tests, pre- and post-creativity tests, and free writings asking their opinions and feelings regarding the experiment. The results demonstrated that there was no significant difference in English ability between the two groups after the experiment. These implied that the two different types of teacher questions had not made a significant difference in English test scores over the experimental period although many of the experimental group students provided positive remarks on the activity they engaged in. As for creativity, the experimental group displayed significantly higher levels than the control group after receiving their teacher`s fat questions, suggesting that L2 learners` creativity levels could be enhanced within rather a short period of time with the help of strategic teaching methodologies such as different types of questions.

  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.sbspro.2014.03.419
The Relationship between Iranian EFL Students’ Brain Dominant Quadrants and Reading Comprehension Skill
  • May 1, 2014
  • Procedia - Social and Behavioral Sciences
  • Hamid Ashraf + 2 more

The Relationship between Iranian EFL Students’ Brain Dominant Quadrants and Reading Comprehension Skill

  • Research Article
  • Cite Count Icon 6
  • 10.1108/el-05-2020-0120
Spam detection and high-quality features to analyse question –answer pairs
  • Nov 26, 2020
  • The Electronic Library
  • Hei Chia Wang + 2 more

Purpose In community question and answer (CQA) services, because of user subjectivity and the limits of knowledge, the distribution of answer quality can vary drastically – from highly related to irrelevant or even spam answers. Previous studies of CQA portals have faced two important issues: answer quality analysis and spam answer filtering. Therefore, the purposes of this study are to filter spam answers in advance using two-phase identification methods and then automatically classify the different types of question and answer (QA) pairs by deep learning. Finally, this study proposes a comprehensive study of answer quality prediction for different types of QA pairs. Design/methodology/approach This study proposes an integrated model with a two-phase identification method that filters spam answers in advance and uses a deep learning method [recurrent convolutional neural network (R-CNN)] to automatically classify various types of questions. Logistic regression (LR) is further applied to examine which answer quality features significantly indicate high-quality answers to different types of questions. Findings There are four prominent findings. (1) This study confirms that conducting spam filtering before an answer quality analysis can reduce the proportion of high-quality answers that are misjudged as spam answers. (2) The experimental results show that answer quality is better when question types are included. (3) The analysis results for different classifiers show that the R-CNN achieves the best macro-F1 scores (74.8%) in the question type classification module. (4) Finally, the experimental results by LR show that author ranking, answer length and common words could significantly impact answer quality for different types of questions. Originality/value The proposed system is simultaneously able to detect spam answers and provide users with quick and efficient retrieval mechanisms for high-quality answers to different types of questions in CQA. Moreover, this study further validates that crucial features exist among the different types of questions that can impact answer quality. Overall, an identification system automatically summarises high-quality answers for each different type of questions from the pool of messy answers in CQA, which can be very useful in helping users make decisions.

  • Research Article
  • 10.1152/physiol.2023.38.s1.5733982
Open-book, open-web versus closed-book, closed-web examinations in anatomy and physiology classes
  • May 1, 2023
  • Physiology
  • Lauren Milam + 2 more

Due to COVID and distance learning, a possible evaluation is open-book examinations. We tested the hypothesis that students taking open-book, open-web exams spend less time studying anatomy and physiology and would performed more poorly on critical thinking exam questions compared to students taking closed-book, closed-web exams. We tested the exam performance using different types of exam questions according to Bloom's taxonomy. As well, students that participated in the research submitted study journals that accounted for the time they spent studying for each exam. Data has been collected for both open-book, open-web and closed-book, closed-web sections over Exams #1, #2, #3, and Final. Data from Exams #1 and #2 for both sections has already been completely analyzed. Data from the open-book section had shown that students' exam performance over five different types of questions were descending as the level of difficult became more challenging in both Exams #1 and #2. Data from the closed-book section showed that as the level of difficult became more challenging, there was not a decline in performance on Exam #1 and there was a less steep decline in performance on Exam #2 than the open-book section. For both exams, there was a statistical significance between the study times of the sections as the average study times of the closed-book section was significantly greater than of that of the open-book section. These results suggest that students in the open-book section deal with challenging questions more poorly than the closed-book section while also studying significantly less than the closed-book, closed-web section. Moreover, the data for Exams #1 through Final of both sections has been collected and will be analyzed and included for further comparison. This research was supported and funded by the Department of Biological Sciences at the University of Tennessee at Martin. This is the full abstract presented at the American Physiology Summit 2023 meeting and is only available in HTML format. There are no additional versions or additional content available for this abstract. Physiology was not involved in the peer review process.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 13
  • 10.1186/s13673-016-0072-3
Evaluating smartphone-based dynamic security questions for fallback authentication: a field study
  • Sep 5, 2016
  • Human-centric Computing and Information Sciences
  • Yusuf Albayram + 1 more

To address the limitations of static challenge question based fallback authentication mechanisms (e.g., easy predictability), recently, smartphone based autobiographical authentication mechanisms have been explored where challenge questions are not predetermined and are instead generated dynamically based on users’ day-to-day activities captured by smartphones. However, as answering different types and styles of questions is likely to require different amounts of cognitive effort and affect users’ performance, a thorough study is required to investigate the effect of type and style of challenge questions and answer selection mechanisms on users’ recall performance and usability of such systems. Towards that, this paper explores seven different types of challenge questions where different types of questions are generated based on users’ smartphone usage data. For evaluation, we conducted a field study for a period of 30 days with 24 participants who were recruited in pairs to simulate different kinds of adversaries (e.g., close friends, significant others). Our findings suggest that the question types do have a significant effect on user performance. Furthermore, to address the variations in users’ accuracy across multiple sessions and question types, we investigate and present a Bayesian classifier based authentication algorithm that can authenticate legitimate users with high accuracy by leveraging individual response patterns.

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/ipcc.2007.4464073
On the Role of Visuals in Multimodal Answers to Medical Questions
  • Oct 1, 2007
  • Charlotte Van Hooijdonk + 5 more

This paper describes two experiments carried out in order to investigate the role of visuals in multimodal answer presentations for a medical question answering system. First, a production experiment was carried out to determine which modalities people choose to answer different types of questions. In this experiment, participants had to create (multimodal) presentations of answers to general medical questions. The collected answer presentations were coded on the presence of visual media (i.e., photos, graphics, and animations) and their function. The results indicated that participants presented the information in a multimodal way. Moreover, significant differences were found in the presentation of different answer and question types. Next, an evaluation experiment was conducted to investigate how users evaluate different types of multimodal answer presentations. In this second experiment, participants had to assess the informativity and attractiveness of answer presentations for different types of medical questions. These answer presentations, originating from the production experiment, were manipulated in their answer length (brief vs. extended) and their type of picture (illustrative vs. informative). After the participants had assessed the answer presentations, they received a post- test in which they had to indicate how much they had recalled from the presented answer presentations. The results showed that answer presentations with an informative picture were evaluated as more informative and more attractive than answer presentations with an illustrative picture. The results for the post-test tentatively indicated that learning from answer presentations with an informative picture leads to a better learning performance than learning from purely textual answer presentations.

  • Research Article
  • 10.20533/ijcdse.2042.6364.2015.0319
What Students Think and How They Really Perform in Chemistry
  • Dec 1, 2015
  • International Journal for Cross-Disciplinary Subjects in Education
  • Ross Hudson

This research was part of a larger study into student performance in senior chemistry with regard to question type and content.This paper examines student perceptions about question type and context and compares these perceptions to actual performance.How students perceive different types of questions and how it influences their self-belief and motivation were the focus of this study.Student responses to different styles or types of questions have been well researched over time.In this study Year 11 chemistry students were quizzed about their preferences to Multiple-Choice questions and Open Response question types and how the presence of each type was likely to influence their test performance.Student's perceptions were then correlated to their actual performance on sample chemistry tests.Students generally preferred MCQ questions and believed they were likely to perform better on these questions regardless of the topic.Test results did not always support this confidence.Suggestions for further research are also made.

  • Research Article
  • 10.24911/sjemed.72-1740484871
Ai Chatbots In Emergency Medicine: Analyzing Agreement with Expert Physician Triage Decisions
  • Jan 1, 2025
  • Saudi Journal of Emergency Medicine
  • Ahmad Aalam

Introduction In emergency medicine, accurate triage is vital for patient outcomes and resource management. The Canadian Triage and Acuity Scale (CTAS) has been essential in training healthcare providers to make prompt and precise triage decisions.1 With the rise of artificial intelligence (AI), chatbots are being considered for their potential to support or even replace human decision-making in various medical situations. This study aims to assess the agreement between AI chatbot triage decisions and those made by experienced emergency physicians using CTAS. Methods This study involved a comparative analysis between an AI chatbot and two expert emergency physicians, each with over ten years of experience. We used a dataset of 60 emergency case scenarios, which have been utilized for over 8-10 years to train medical personnel at the start of their careers.1 The AI chatbot received training materials on CTAS and triage before being tasked with assigning appropriate triage levels for each scenario. Meanwhile, the expert physicians independently triaged the same cases. Scenarios where the two experts disagreed on the triage level were excluded, leaving 35 case scenarios for the final analysis. To evaluate the agreement between the AI chatbot and the expert physicians, we used the Cohen's Kappa coefficient. This included determining the Cohen's Kappa coefficient value, the p-value, and the 95% confidence interval (CI) to assess the statistical significance and reliability of the agreement. Results The Cohen's Kappa coefficient value between the AI chatbot and the expert physicians was 0.721, indicating a substintial level of agreement. The p-value was

  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/978-981-13-7150-9_21
Impact of Ambiguity: Wh-Questions Versus Other Questions in Question Paper Translation
  • Jan 1, 2019
  • Shweta Vikram + 1 more

Word Sense Disambiguation (WSD) is a prominent area of research in the field of linguistics. A number of researches have been made to resolve the ambiguity issue in natural sentences. If a sentence has ambiguity or ambiguous word in it, then the meaning of this sentence may differ from context. If the meaning of the sentence is not appropriately inferred from the context, then the WSD algorithms are used to remove the ambiguity. This paper discusses the issue of ambiguity in translation of question paper through various MT tools. In our experiment, we have collected different types of questions for analyzing the impact of ambiguity for wh-questions with respect to other questions (objective, match, fill in the blank and keyword specific). Some machine translators often fail to understand different types of questions and treat them as normal question/sentence. We used five different types of questions in English to translate them using five standard online/offline translators into respective Hindi translation. Our aim is to analyze the impact of translations that arise due to ambiguities. The experiment carried using 150 questions of different types, and the result suggests that most of the translations have performed better in objective questions while the keyword specific questions (such as discuss, explain, etc.) performed poorly.

  • Research Article
  • Cite Count Icon 40
  • 10.1093/poq/nfu017
Response Heaping in Interviewer-Administered Surveys: Is It Really a Form of Satisficing?
  • Aug 13, 2014
  • Public Opinion Quarterly
  • A L Holbrook + 6 more

Response heaping (also referred to as rounding or digit preference) occurs when respondents show a preference for rounded numbers (often those divisible by five or 10). Conventional wisdom is that this is the result of taking cognitive shortcuts to make question answering easier, and as such, that it may be a form of survey satisficing. In four studies, we test this conventional wisdom for the first time by exploring whether response heaping occurs for five types of survey questions (behavioral frequency questions, questions that ask about an individual’s personal characteristics, questions that ask about an individual’s age at the time of an event, questions that ask the respondent to report a percentage, and feeling-thermometer attitude reports) under the conditions thought to foster survey satisficing (e.g., among respondents lower in ability and motivation, when the task of question-answering is difficult, and later in a long questionnaire) and whether heaped responses show effects of survey satisficing (e.g., shorter response latencies, less accuracy, and lowered predictive validity). We also examine the prevalence of response heaping and the extent to which heaping is associated across questions. Heaping above chance levels was found for most types of questions (although the prevalence of heaping varied systematically across different types of questions), but we found little evidence that heaping for most types of questions is more common under conditions thought to foster satisficing. In fact, heaping for some questions may actually reflect more thoughtful processes and result in higher data quality.

  • Research Article
  • 10.63056/acad.004.04.1184
Examine the Effectiveness of AI Chatbots Responses on Health Library Reference Services
  • Dec 12, 2025
  • ACADEMIA International Journal for Social Sciences
  • Hajra Kalsoom + 5 more

The purpose of this study is to investigate the potential for AI chatbots to supplement health library reference services. The research seeks to provide valuable insights into AI chatbots' effectiveness, limitations, and best practices in healthcare libraries by analyzing how they respond to medical student reference queries. An extensive literature review was conducted to inform the design of an investigation into how AI chatbots respond to medical students' reference queries. Several key terms, including "artificial intelligence chatbots," "artificial intelligence technologies," "AI reference services," and "library reference services," were used to search scholarly databases. Research sources were screened for their functionalities, limitations, and best practices for AI chatbots in healthcare libraries. Artificial intelligence chatbots may even perform reference services, relieving librarians of their workload. AI chatbots can revolutionize healthcare education, reference services, and research. Health librarian and medical student training programs are necessary to ensure successful integration. Several factors hinder the integration of AI into library operations, such as insufficient funding, librarian disinterest, and technical skills shortages. Librarians should be prepared to integrate AI into library reference services and address concerns regarding information misuse in the future. AI has the ability to enhance library reference services, improve efficiency, and provide access to information. There are many benefits to implementing AI chatbots for health library reference services, including improved efficiency, 24/7 access to health information. The study outlines the importance of health libraries to the use of AI by medical students, as well as the role of AI-Chatbots can play in facilitating reference queries in this field.

  • Research Article
  • Cite Count Icon 32
  • 10.1002/ca.980080211
Assessment of basic medical sciences in an integrated systems-based curriculum.
  • Jan 1, 1995
  • Clinical Anatomy
  • S Moqattash + 3 more

Basic medical sciences at Sultan Qaboos University (SQU) are taught in a systems-based curriculum. During the development of the courses different formats have been used for the written examinations and also different types of questions. This paper compares students' performance in relation to examination format and to types of questions used. The formats were non-coordinated (NCAs), each discipline having a separate paper; coordinated (CAs), questions from various disciplines being given in the same paper but with separate sections for each discipline; and integrated assessments (IAs), questions being grouped under structure, function, and problem-based integrated long essays. The types of questions used were multiple choice (MCQs), short essays (SEQs), and structured integrated long essays (SILEQs). Students performed better in SEQs than in MCQs. Our analyses also show that SILEQs measure skills similar to those of MCQs and SEQs combined. Students performed best in NCAs. In CAs, students concentrated on those disciplines carrying most weight in the final grade. Currently we use IAs consisting of two parts: part I, comprising MCQs and SEQs, and part II, comprising SILEQs. To date, students are performing better in part II than in part I. We suggest that it is prudent to use different types of questions to measure students' knowledge and skills when IAs are used for systems-based courses.

  • Book Chapter
  • Cite Count Icon 3
  • 10.1007/978-0-85729-443-2_9
Types of Questions in Computer Science Education
  • Jan 1, 2011
  • Orit Hazzan + 2 more

As in the teaching of any discipline, computer science teachers are expected to vary their teaching methods, and therefore this pedagogical issue should be included in the MTCS course. This chapter focuses on how to achieve this pedagogical target by using different types of questions. It explores and discusses different types of questions that computer science educators (middle and high school teachers as well as university instructors) can use in different teaching situations and processes: in the classroom, in the computer lab, as homework, or in tests. The chapter lays out also the advantages of using a variety of question types both for learners and teachers, and focuses on the design process of different question types. Though the types of questions presented are mainly related to programming assignments, most of them are suitable also for other computer science contents.

  • Research Article
  • Cite Count Icon 26
  • 10.1016/j.ijer.2020.101690
Categorizing mathematics teachers’ questioning: The demands and contributions of teachers’ questions
  • Jan 1, 2020
  • International Journal of Educational Research
  • Anna F Dejarnette + 2 more

Categorizing mathematics teachers’ questioning: The demands and contributions of teachers’ questions

  • Book Chapter
  • Cite Count Icon 11
  • 10.1007/978-3-540-88309-8_24
Automatic Generalization of a QA Answer Extraction Module Based on Semantic Roles
  • Oct 14, 2008
  • P Moreda + 3 more

In recent years, improvements on automatic semantic role labeling have grown the interest of researchers in its application to different NLP fields, specially to QA systems. We present a proposal of automatic generalization of the use of SR in QA systems to extract answers for different types of questions. Firstly, we have implemented two different versions of an answer extraction module using SR: a) rules-based, and b) patterns-based. These modules work as part of a QA system to extract answers for location questions. Secondly, these approaches have been automatically generalized to any type of factoid questions using generalization rules. The whole system has been evaluated using both location and temporal questions from TREC datasets. Results indicate that an automatic generalization is feasible, obtaining same quality results for both original type of questions and new auto-generalized one (Precision: 88.20% LOC and 95.08% TMP). Furthermore, results show that patterns-based approach works better in both types of questions (F1 improvement + 40.88% LOC and + 15.41% TMP).KeywordsSemantic RolesQuestion AnsweringSemantic RulesSemantic PatternsInternet Search Engines

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant