COVID-19 Open Research Dataset Research Articles

BackgroundThe scale and quality of the global scientific response to the COVID-19 pandemic have unquestionably saved lives. However, the COVID-19 pandemic has also triggered an unprecedented “infodemic”; the velocity and volume of data production have overwhelmed many key stakeholders such as clinicians and policy makers, as they have been unable to process structured and unstructured data for evidence-based decision making. Solutions that aim to alleviate this data synthesis–related challenge are unable to capture heterogeneous web data in real time for the production of concomitant answers and are not based on the high-quality information in responses to a free-text query.ObjectiveThe main objective of this project is to build a generic, real-time, continuously updating curation platform that can support the data synthesis and analysis of a scientific literature framework. Our secondary objective is to validate this platform and the curation methodology for COVID-19–related medical literature by expanding the COVID-19 Open Research Dataset via the addition of new, unstructured data.MethodsTo create an infrastructure that addresses our objectives, the PanSurg Collaborative at Imperial College London has developed a unique data pipeline based on a web crawler extraction methodology. This data pipeline uses a novel curation methodology that adopts a human-in-the-loop approach for the characterization of quality, relevance, and key evidence across a range of scientific literature sources.ResultsREDASA (Realtime Data Synthesis and Analysis) is now one of the world’s largest and most up-to-date sources of COVID-19–related evidence; it consists of 104,000 documents. By capturing curators’ critical appraisal methodologies through the discrete labeling and rating of information, REDASA rapidly developed a foundational, pooled, data science data set of over 1400 articles in under 2 weeks. These articles provide COVID-19–related information and represent around 10% of all papers about COVID-19.ConclusionsThis data set can act as ground truth for the future implementation of a live, automated systematic review. The three benefits of REDASA’s design are as follows: (1) it adopts a user-friendly, human-in-the-loop methodology by embedding an efficient, user-friendly curation platform into a natural language processing search engine; (2) it provides a curated data set in the JavaScript Object Notation format for experienced academic reviewers’ critical appraisal choices and decision-making methodologies; and (3) due to the wide scope and depth of its web crawling method, REDASA has already captured one of the world’s largest COVID-19–related data corpora for searches and curation.

Read full abstract

BackgroundShortly after the emergence of COVID-19, researchers rapidly mobilized to study numerous aspects of the disease such as its evolution, clinical manifestations, effects, treatments, and vaccinations. This led to a rapid increase in the number of COVID-19–related publications. Identifying trends and areas of interest using traditional review methods (eg, scoping and systematic reviews) for such a large domain area is challenging.ObjectiveWe aimed to conduct an extensive bibliometric analysis to provide a comprehensive overview of the COVID-19 literature.MethodsWe used the COVID-19 Open Research Dataset (CORD-19) that consists of a large number of research articles related to all coronaviruses. We used a machine learning–based method to analyze the most relevant COVID-19–related articles and extracted the most prominent topics. Specifically, we used a clustering algorithm to group published articles based on the similarity of their abstracts to identify research hotspots and current research directions. We have made our software accessible to the community via GitHub.ResultsOf the 196,630 publications retrieved from the database, we included 28,904 in our analysis. The mean number of weekly publications was 990 (SD 789.3). The country that published the highest number of COVID-19–related articles was China (2950/17,270, 17.08%). The highest number of articles were published in bioRxiv. Lei Liu affiliated with the Southern University of Science and Technology in China published the highest number of articles (n=46). Based on titles and abstracts alone, we were able to identify 1515 surveys, 733 systematic reviews, 512 cohort studies, 480 meta-analyses, and 362 randomized control trials. We identified 19 different topics covered among the publications reviewed. The most dominant topic was public health response, followed by clinical care practices during the COVID-19 pandemic, clinical characteristics and risk factors, and epidemic models for its spread.ConclusionsWe provide an overview of the COVID-19 literature and have identified current hotspots and research directions. Our findings can be useful for the research community to help prioritize research needs and recognize leading COVID-19 researchers, institutes, countries, and publishers. Our study shows that an AI-based bibliometric analysis has the potential to rapidly explore a large corpus of academic publications during a public health crisis. We believe that this work can be used to analyze other eHealth-related literature to help clinicians, administrators, and policy makers to obtain a holistic view of the literature and be able to categorize different topics of the existing research for further analyses. It can be further scaled (for instance, in time) to clinical summary documentation. Publishers should avoid noise in the data by developing a way to trace the evolution of individual publications and unique authors.

Read full abstract

COVID-19 Open Research Dataset Research Articles

Related Topics

Articles published on COVID-19 Open Research Dataset

New trends in scientific knowledge graphs and research impact assessment

Covid-on-the-Web: Exploring the COVID-19 scientific literature through visualization of linked data from entity and argument mining

Converting Biomedical Text Annotated Resources into FAIR Research Objects with an Open Science Platform

Demystifying COVID-19 publications: institutions, journals, concepts, and topics.

Efficient Self-Supervised Metric Information Retrieval: A Bibliography Based Method Applied to COVID Literature.

Models and Processes to Extract Drug-like Molecules From Natural Language Text.

Shall I Work with Them? A Knowledge Graph-Based Approach for Predicting Future Research Collaborations.

KAAPA: Knowledge Aware Answers from PDF Analysis

Queries related to COVID-19: a more effective retrieval through finetuned ALBERT with BM25L question answering system

Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study.

COVID-19 information retrieval with deep-learning based semantic search, question answering, and abstractive summarization

A comparative analysis of system features used in the TREC-COVID information retrieval challenge

HITS-based attentional neural model for abstractive summarization

Revealing Opinions for COVID-19 Questions Using a Context Retriever, Opinion Aggregator, and Question-Answering Model: Model Development Study.

A Comprehensive Overview of the COVID-19 Literature: Machine Learning-Based Bibliometric Analysis.

Impact of COVID-19 on longitudinal ophthalmology authorship gender trends

WITHDRAWN: Classification of covid related articles using machine learning

The nurse COVID and historical epidemics literature repository: Development, description, and summary

Convergence in Viral Outbreak Research: Using Natural Language Processing to Define Network Bridges in the Bench-Bedside-Population Paradigm

Classification of a COVID-19 dataset by using labels created from clustering algorithms

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

COVID-19 Open Research Dataset Research Articles

Related Topics

Articles published on COVID-19 Open Research Dataset

New trends in scientific knowledge graphs and research impact assessment

Covid-on-the-Web: Exploring the COVID-19 scientific literature through visualization of linked data from entity and argument mining

Converting Biomedical Text Annotated Resources into FAIR Research Objects with an Open Science Platform

Demystifying COVID-19 publications: institutions, journals, concepts, and topics.

Efficient Self-Supervised Metric Information Retrieval: A Bibliography Based Method Applied to COVID Literature.

Models and Processes to Extract Drug-like Molecules From Natural Language Text.

Shall I Work with Them? A Knowledge Graph-Based Approach for Predicting Future Research Collaborations.

KAAPA: Knowledge Aware Answers from PDF Analysis

Queries related to COVID-19: a more effective retrieval through finetuned ALBERT with BM25L question answering system

Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study.

COVID-19 information retrieval with deep-learning based semantic search, question answering, and abstractive summarization

A comparative analysis of system features used in the TREC-COVID information retrieval challenge

HITS-based attentional neural model for abstractive summarization

Revealing Opinions for COVID-19 Questions Using a Context Retriever, Opinion Aggregator, and Question-Answering Model: Model Development Study.

A Comprehensive Overview of the COVID-19 Literature: Machine Learning-Based Bibliometric Analysis.

Impact of COVID-19 on longitudinal ophthalmology authorship gender trends

WITHDRAWN: Classification of covid related articles using machine learning

The nurse COVID and historical epidemics literature repository: Development, description, and summary

Convergence in Viral Outbreak Research: Using Natural Language Processing to Define Network Bridges in the Bench-Bedside-Population Paradigm

Classification of a COVID-19 dataset by using labels created from clustering algorithms