CORD-19 Dataset Research Articles

BackgroundKnowledges graphs (KGs) serve as a convenient framework for structuring knowledge. A number of computational methods have been developed to generate KGs from biomedical literature and use them for downstream tasks such as link prediction and question answering. However, there is a lack of computational tools or web frameworks to support the exploration and visualization of the KG themselves, which would facilitate interactive knowledge discovery and formulation of novel biological hypotheses.MethodWe developed a web framework for Knowledge Graph Exploration and Visualization (KGEV), to construct and visualize KGs in five stages: triple extraction, triple filtration, metadata preparation, knowledge integration, and graph database preparation. The application has convenient user interface tools, such as node and edge search and filtering, data source filtering, neighborhood retrieval, and shortest path calculation, that work by querying a backend graph database. Unlike other KGs, our framework allows fast retrieval of relevant texts supporting the relationships in the KG, thus allowing human reviewers to judge the reliability of the knowledge extracted.ResultsWe demonstrated a case study of using the KGEV framework to perform research on COVID-19. The COVID-19 pandemic resulted in an explosion of relevant literature, making it challenging to make full use of the vast and heterogenous sources of information. We generated a COVID-19 KG with heterogenous information, including literature information from the CORD-19 dataset, as well as other existing knowledge from eight data sources. We showed the utility of KGEV in three intuitive case studies to explore and query knowledge on COVID-19. A demo of this web application can be accessed at http://covid19nlp.wglab.org. Finally, we also demonstrated a turn-key adaption of the KGEV framework to study clinical phenotypic presentation of human diseases by Human Phenotype Ontology (HPO), illustrating the versatility of the framework.ConclusionIn an era of literature explosion, the KGEV framework can be applied to many emerging diseases to support structured navigation of the vast amount of newly published biomedical literature and other existing biological knowledge in various databases. It can be also used as a general-purpose tool to explore and query gene-phenotype-disease-drug relationships interactively.

Read full abstract

We address the problem of extracting reports of statistics along with information about the experiment conditions and experiment topics from scientific publications. A common writing style for statistical results are the recommendations of the American Psychology Association (APA). In practice, writing styles vary as reports are not 100\% following APA-style or parameters are not reported despite being mandatory. In addition, the statistics are not reported in isolation but in context of experiment conditions investigated and the general experiment topic. We address these challenges by proposing a flexible pipeline STEREO based on wrapper induction and unsupervised aspect detection to extract experiment statistics, conditions, and topics. Thus, in contrast to existing rule-based tools like statcheck with a pre-defined set of rules, we learn rules via induction. Hierarchical wrapper induction is applied to learn rules to extract the reported statistics. Challenge here is to apply wrapper induction on an information extraction task without having formatting landmarks as they can be exploited in HTML pages. Result of step 1 is a set of extracted statistic reports together with sentences in which the reports were found. This is used as input to step 2 of STEREO, which has two parts. We extract experiment conditions using a grammar-based wrapper. Furthermore, we identify the experiment topic using an unsupervised attention-based aspect extraction approach adapted to our problem domain. We applied our pipeline to the over 100,000 documents in the CORD-19 dataset. It required only 0.25% of the CORD-19 corpus (about 500 documents) to learn statistics extraction rules that cover 95% of the sentences in CORD-19. The statistic extraction has 100% precision on APA-conform statistics, which is identical with statcheck. In addition, STEREO can extract non-APA writing styles with 95% precision, which statcheck does not support. Extracting non-APA conform statistics is important as they make more than 99% of all $113$k extracted statistics. We could extract in 46% the correct conditions from APA-conform reports (30% for non-APA). The best model for topic extraction achieves a precision of 75% on statistics reported in APA style $73% for non-APA conform). We conclude that STEREO is a good foundation for automatic statistic extraction and future developments for scientific paper analysis. Particularly the extraction of non-APA conform reports is important and allows applications such as giving feedback to authors about what is missing and could be changed. Finally, STEREO complements existing metadata extraction tools and can be integrated in a general scientific paper analysis pipeline.

Read full abstract

CORD-19 Dataset Research Articles

Related Topics

Articles published on CORD-19 Dataset

CovSumm: an unsupervised transformer-cum-graph-based hybrid document summarization model for CORD-19.

Avoiding background knowledge: literature based discovery from important information

Extraction of knowledge graph of Covid-19 through mining of unstructured biomedical corpora

Building an intelligent system for answering specialized questions about COVID-19

Do We Need a Specific Corpus and Multiple High-Performance GPUs for Training the BERT Model? An Experiment on COVID-19 Dataset

UGDAS: Unsupervised graph-network based denoiser for abstractive summarization in biomedical domain.

Expediting knowledge acquisition by a web framework for Knowledge Graph Exploration and Visualization (KGEV): case studies on COVID-19 and Human Phenotype Ontology

Contextual Query Expansion for Conducting Technology-Assisted Biomedical Reviews

Extracting Experiment Statistics, Conditions, and Topics from Scientific Papers with STEREO

Execution Time Prediction for Cypher Queries in the Neo4j Database Using a Learning Approach

Analyzing transfer learning impact in biomedical cross-lingual named entity recognition and normalization

A REVIEW ON QUESTION AND ANSWER SYSTEM FOR COVID-19 LITERATURE ON PRE-TRAINED MODELS

Queries related to COVID-19: a more effective retrieval through finetuned ALBERT with BM25L question answering system

HITS-based attentional neural model for abstractive summarization

Automatic Text Summarization of COVID-19 Research Articles Using Recurrent Neural Networks and Coreference Resolution

Evidence-Based Recommender System for a COVID-19 Publication Analytics Service

Sistemas de recuperación de información implementados a partir de CORD-19: herramientas clave en la gestión de la información sobre COVID-19

Dental Risks and Precautions during COVID-19 Pandemic: A Systematic Review.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

CORD-19 Dataset Research Articles

Related Topics

Articles published on CORD-19 Dataset

CovSumm: an unsupervised transformer-cum-graph-based hybrid document summarization model for CORD-19.

Avoiding background knowledge: literature based discovery from important information

Extraction of knowledge graph of Covid-19 through mining of unstructured biomedical corpora

Building an intelligent system for answering specialized questions about COVID-19

Do We Need a Specific Corpus and Multiple High-Performance GPUs for Training the BERT Model? An Experiment on COVID-19 Dataset

UGDAS: Unsupervised graph-network based denoiser for abstractive summarization in biomedical domain.

Expediting knowledge acquisition by a web framework for Knowledge Graph Exploration and Visualization (KGEV): case studies on COVID-19 and Human Phenotype Ontology

Contextual Query Expansion for Conducting Technology-Assisted Biomedical Reviews

Extracting Experiment Statistics, Conditions, and Topics from Scientific Papers with STEREO

Execution Time Prediction for Cypher Queries in the Neo4j Database Using a Learning Approach

Analyzing transfer learning impact in biomedical cross-lingual named entity recognition and normalization

A REVIEW ON QUESTION AND ANSWER SYSTEM FOR COVID-19 LITERATURE ON PRE-TRAINED MODELS

Queries related to COVID-19: a more effective retrieval through finetuned ALBERT with BM25L question answering system

HITS-based attentional neural model for abstractive summarization

Automatic Text Summarization of COVID-19 Research Articles Using Recurrent Neural Networks and Coreference Resolution

Evidence-Based Recommender System for a COVID-19 Publication Analytics Service

Sistemas de recuperación de información implementados a partir de CORD-19: herramientas clave en la gestión de la información sobre COVID-19

Dental Risks and Precautions during COVID-19 Pandemic: A Systematic Review.