Pre-trained Transformer Research Articles

Artificial intelligence (AI) applications in health care have been effective in many areas of medicine, but they are often trained for a single task using labelled data, making deployment and generalisability challenging. How well a general-purpose AI language model performs diagnosis and triage relative to physicians and laypeople is not well understood. We compared the predictive accuracy of Generative Pre-trained Transformer 3 (GPT-3)'s diagnostic and triage ability for 48 validated synthetic case vignettes (<50 words; sixth-grade reading level or below) of both common (eg, viral illness) and severe (eg, heart attack) conditions to a nationally representative sample of 5000 lay people from the USA who could use the internet to find the correct options and 21 practising physicians at Harvard Medical School. There were 12 vignettes for each of four triage categories: emergent, within one day, within 1 week, and self-care. The correct diagnosis and triage category (ie, ground truth) for each vignette was determined by two general internists at Harvard Medical School. For each vignette, human respondents and GPT-3 were prompted to list diagnoses in order of likelihood, and the vignette was marked as correct if the ground-truth diagnosis was in the top three of the listed diagnoses. For triage accuracy, we examined whether the human respondents' and GPT-3's selected triage was exactly correct according to the four triage categories, or matched a dichotomised triage variable (emergent or within 1 day vs within 1 week or self-care). We estimated GPT-3's diagnostic and triage confidence on a given vignette using a modified bootstrap resampling procedure, and examined how well calibrated GPT-3's confidence was by computing calibration curves and Brier scores. We also performed subgroup analysis by case acuity, and an error analysis for triage advice to characterise how its advice might affect patients using this tool to decide if they should seek medical care immediately. Among all cases, GPT-3 replied with the correct diagnosis in its top three for 88% (42/48, 95% CI 75-94) of cases, compared with 54% (2700/5000, 53-55) for lay individuals (p<0.0001) and 96% (637/666, 94-97) for physicians (p=0·012). GPT-3 triaged 70% correct (34/48, 57-82) versus 74% (3706/5000, 73-75; p=0.60) for lay individuals and 91% (608/666, 89-93%; p<0.0001) for physicians. As measured by the Brier score, GPT-3 confidence in its top prediction was reasonably well calibrated for diagnosis (Brier score=0·18) and triage (Brier score=0·22). We observed an inverse relationship between case acuity and GPT-3 accuracy (p<0·0001) with a fitted trend line of -8·33% decrease in accuracy for every level of increase in case acuity. For triage error analysis, GPT-3 deprioritised truly emergent cases in seven instances. A general-purpose AI language model without any content-specific training could perform diagnosis at levels close to, but below, physicians and better than lay individuals. We found that GPT-3's performance was inferior to physicians for triage, sometimes by a large margin, and its performance was closer to that of lay individuals. Although the diagnostic performance of GPT-3 was comparable to physicians, it was significantly better than a typical person using a search engine. The National Heart, Lung, and Blood Institute.

Background and objectiveResearchers commonly use automated solutions such as Natural Language Processing (NLP) systems to extract clinical information from large volumes of unstructured data. However, clinical text's poor semantic structure and domain-specific vocabulary can make it challenging to develop a one-size-fits-all solution. Large Language Models (LLMs), such as OpenAI's Generative Pre-Trained Transformer 3 (GPT-3), offer a promising solution for capturing and standardizing unstructured clinical information. This study evaluated the performance of InstructGPT, a family of models derived from LLM GPT-3, to extract relevant patient information from medical case reports and discussed the advantages and disadvantages of LLMs versus dedicated NLP methods. MethodsIn this paper, 208 articles related to case reports of foreign body injuries in children were identified by searching PubMed, Scopus, and Web of Science. A reviewer manually extracted information on sex, age, the object that caused the injury, and the injured body part for each patient to build a gold standard to compare the performance of InstructGPT. ResultsInstructGPT achieved high accuracy in classifying the sex, age, object and body part involved in the injury, with 94%, 82%, 94% and 89%, respectively. When excluding articles for which InstructGPT could not retrieve any information, the accuracy for determining the child's sex and age improved to 97%, and the accuracy for identifying the injured body part improved to 93%. InstructGPT was also able to extract information from non-English language articles. ConclusionsThe study highlights that LLMs have the potential to eliminate the necessity for task-specific training (zero-shot extraction), allowing the retrieval of clinical information from unstructured natural language text, particularly from published scientific literature like case reports, by directly utilizing the PDF file of the article without any pre-processing and without requiring any technical expertise in NLP or Machine Learning. The diverse nature of the corpus, which includes articles written in languages other than English, some of which contain a wide range of clinical details while others lack information, adds to the strength of the study.

Pre-trained Transformer Research Articles

Related Topics

Articles published on Pre-trained Transformer

Multimodal Affective Communication Analysis: Fusing Speech Emotion and Text Sentiment Using Machine Learning

Towards python program repair with generative pre-trained transformer (GPT-3.5)

Improving biomedical entity linking for complex entity mentions with LLM-based text simplification.

Assessing the Ability of a Large Language Model to Score Free-Text Medical Student Clinical Notes: Quantitative Study.

PIM GPT a hybrid process in memory accelerator for autoregressive transformers

ChatGPT-4 Consistency in Interpreting Laryngeal Clinical Images of Common Lesions and Disorders.

Evaluating ChatGPT's Performance in the EU*US eHealth Work Foundational Curriculum Using the HITCOMP Self-Assessment Quiz.

ChatGPT sitting for FRCS Urology examination: Will artificial intelligence get certified?

ChatGPT Research in Healthcare is Increasing Dramatically: A Bibliometric Analysis Based on VOSviewer.

Caution Regarding ChatGPT's Appropriateness and Reliability Regarding Surgery for Wrist Arthritis.

The diagnostic and triage accuracy of the GPT-3 artificial intelligence model: an observational study

Is GPT-4 Conscious?

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

Authentic assessment in medical education: exploring AI integration and student-as-partners collaboration.

Evaluation of the Diagnostic Accuracy of GPT-4 in Five Thousand Rare Disease Cases.

Towards efficient AutoML: a pipeline synthesis approach leveraging pre-trained transformers for multimodal data

Fine-Grained Sentiment Classification Using Generative Pretrained Transformer

Examining ChatGPT adoption among educators in higher educational institutions using extended UTAUT model

Information extraction from medical case reports using OpenAI InstructGPT

The judge, the AI, and the Crown: a collusive network

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Pre-trained Transformer Research Articles

Related Topics

Articles published on Pre-trained Transformer

Multimodal Affective Communication Analysis: Fusing Speech Emotion and Text Sentiment Using Machine Learning

Towards python program repair with generative pre-trained transformer (GPT-3.5)

Improving biomedical entity linking for complex entity mentions with LLM-based text simplification.

Assessing the Ability of a Large Language Model to Score Free-Text Medical Student Clinical Notes: Quantitative Study.

PIM GPT a hybrid process in memory accelerator for autoregressive transformers

ChatGPT-4 Consistency in Interpreting Laryngeal Clinical Images of Common Lesions and Disorders.

Evaluating ChatGPT's Performance in the EU*US eHealth Work Foundational Curriculum Using the HITCOMP Self-Assessment Quiz.

ChatGPT sitting for FRCS Urology examination: Will artificial intelligence get certified?

ChatGPT Research in Healthcare is Increasing Dramatically: A Bibliometric Analysis Based on VOSviewer.

Caution Regarding ChatGPT's Appropriateness and Reliability Regarding Surgery for Wrist Arthritis.

The diagnostic and triage accuracy of the GPT-3 artificial intelligence model: an observational study

Is GPT-4 Conscious?

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

Authentic assessment in medical education: exploring AI integration and student-as-partners collaboration.

Evaluation of the Diagnostic Accuracy of GPT-4 in Five Thousand Rare Disease Cases.

Towards efficient AutoML: a pipeline synthesis approach leveraging pre-trained transformers for multimodal data

Fine-Grained Sentiment Classification Using Generative Pretrained Transformer

Examining ChatGPT adoption among educators in higher educational institutions using extended UTAUT model

Information extraction from medical case reports using OpenAI InstructGPT

The judge, the AI, and the Crown: a collusive network