A statistical approach to identify, monitor, and manage incomplete curated data sets

Douglas G Howe

doi:10.1186/s12859-018-2121-6

Abstract

BackgroundMany biological knowledge bases gather data through expert curation of published literature. High data volume, selective partial curation, delays in access, and publication of data prior to the ability to curate it can result in incomplete curation of published data. Knowing which data sets are incomplete and how incomplete they are remains a challenge. Awareness that a data set may be incomplete is important for proper interpretation, to avoiding flawed hypothesis generation, and can justify further exploration of published literature for additional relevant data. Computational methods to assess data set completeness are needed. One such method is presented here.ResultsIn this work, a multivariate linear regression model was used to identify genes in the Zebrafish Information Network (ZFIN) Database having incomplete curated gene expression data sets. Starting with 36,655 gene records from ZFIN, data aggregation, cleansing, and filtering reduced the set to 9870 gene records suitable for training and testing the model to predict the number of expression experiments per gene. Feature engineering and selection identified the following predictive variables: the number of journal publications; the number of journal publications already attributed for gene expression annotation; the percent of journal publications already attributed for expression data; the gene symbol; and the number of transgenic constructs associated with each gene. Twenty-five percent of the gene records (2483 genes) were used to train the model. The remaining 7387 genes were used to test the model. One hundred and twenty-two and 165 of the 7387 tested genes were identified as missing expression annotations based on their residuals being outside the model lower or upper 95% confidence interval respectively. The model had precision of 0.97 and recall of 0.71 at the negative 95% confidence interval and precision of 0.76 and recall of 0.73 at the positive 95% confidence interval.ConclusionsThis method can be used to identify data sets that are incompletely curated, as demonstrated using the gene expression data set from ZFIN. This information can help both database resources and data consumers gauge when it may be useful to look further for published data to augment the existing expertly curated information.

Highlights

Many biological knowledge bases gather data through expert curation of published literature
The MachineLearningReport.txt file included one row of data for each of the 36,655 gene records found in Zebrafish Information Network (ZFIN) at the time the file was generated
ZFIN includes many publication types, but only journal publications were included in this study because they are the source of the gene expression annotations being modeled

Summary

Introduction

Many biological knowledge bases gather data through expert curation of published literature. Computational methods to assess data set completeness are needed. The biological sciences have benefited immensely from new technologies and methods in both biological research and computer sciences Together these advances have produced a surge of new data. Assessing how complete or correct a large data set may be remains a challenge, examples have been reported. Examples include computational methods for identifying data updates and artifacts that may be of interest to downstream data consumers [1], machine learning methods to identify incorrectly classified G-protein coupled receptors [2], and to improve the quality of large data sets prior to quantitative structure-activity relationship modeling [3]. The completeness and quality of curated nanomaterial data has been explored [4]

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Apr 2, 2018
Citations: 2	License type: open-access

R Discovery Prime

R Discovery Prime

A statistical approach to identify, monitor, and manage incomplete curated data sets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

The Impact of Gene Expression Regulation on Evolution of Extracellular Signaling Pathways
Varodom Charoensawan ... Sarah A Teichmann
Molecular & Cellular Proteomics | VOL. 9
Varodom Charoensawan, et. al.Varodom Charoensawan ... Sarah A Teichmann
01 Dec 2010
Molecular & Cellular Proteomics | VOL. 9

The Zebrafish Information Network: major gene page and home page updates.
Douglas G Howe ... Patrick Kalita
Nucleic Acids Research | VOL. 49
Douglas G Howe, et. al.Douglas G Howe ... Patrick Kalita
10 Nov 2020
Nucleic Acids Research | VOL. 49

Next-generation text-mining mediated generation of chemical response-specific gene sets for interpretation of gene expression data
Kristina M Hettne ... Esther De Jong
BMC Medical Genomics | VOL. 6
Kristina M Hettne, et. al.Kristina M Hettne ... Esther De Jong
29 Jan 2013
BMC Medical Genomics | VOL. 6

The Zebrafish Insertion Collection (ZInC): a web based, searchable collection of zebrafish mutations generated by DNA insertion
Gaurav K Varshney ... Shuo Lin
Nucleic Acids Research | VOL. 41
Gaurav K Varshney, et. al.Gaurav K Varshney ... Shuo Lin
23 Nov 2012
Nucleic Acids Research | VOL. 41

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A statistical approach to identify, monitor, and manage incomplete curated data sets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics