Abstract
BackgroundMany biological knowledge bases gather data through expert curation of published literature. High data volume, selective partial curation, delays in access, and publication of data prior to the ability to curate it can result in incomplete curation of published data. Knowing which data sets are incomplete and how incomplete they are remains a challenge. Awareness that a data set may be incomplete is important for proper interpretation, to avoiding flawed hypothesis generation, and can justify further exploration of published literature for additional relevant data. Computational methods to assess data set completeness are needed. One such method is presented here.ResultsIn this work, a multivariate linear regression model was used to identify genes in the Zebrafish Information Network (ZFIN) Database having incomplete curated gene expression data sets. Starting with 36,655 gene records from ZFIN, data aggregation, cleansing, and filtering reduced the set to 9870 gene records suitable for training and testing the model to predict the number of expression experiments per gene. Feature engineering and selection identified the following predictive variables: the number of journal publications; the number of journal publications already attributed for gene expression annotation; the percent of journal publications already attributed for expression data; the gene symbol; and the number of transgenic constructs associated with each gene. Twenty-five percent of the gene records (2483 genes) were used to train the model. The remaining 7387 genes were used to test the model. One hundred and twenty-two and 165 of the 7387 tested genes were identified as missing expression annotations based on their residuals being outside the model lower or upper 95% confidence interval respectively. The model had precision of 0.97 and recall of 0.71 at the negative 95% confidence interval and precision of 0.76 and recall of 0.73 at the positive 95% confidence interval.ConclusionsThis method can be used to identify data sets that are incompletely curated, as demonstrated using the gene expression data set from ZFIN. This information can help both database resources and data consumers gauge when it may be useful to look further for published data to augment the existing expertly curated information.
Highlights
Many biological knowledge bases gather data through expert curation of published literature
The MachineLearningReport.txt file included one row of data for each of the 36,655 gene records found in Zebrafish Information Network (ZFIN) at the time the file was generated
ZFIN includes many publication types, but only journal publications were included in this study because they are the source of the gene expression annotations being modeled
Summary
Many biological knowledge bases gather data through expert curation of published literature. Computational methods to assess data set completeness are needed. The biological sciences have benefited immensely from new technologies and methods in both biological research and computer sciences Together these advances have produced a surge of new data. Assessing how complete or correct a large data set may be remains a challenge, examples have been reported. Examples include computational methods for identifying data updates and artifacts that may be of interest to downstream data consumers [1], machine learning methods to identify incorrectly classified G-protein coupled receptors [2], and to improve the quality of large data sets prior to quantitative structure-activity relationship modeling [3]. The completeness and quality of curated nanomaterial data has been explored [4]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.