Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study.

Qingyu Chen,Karin Verspoor,Justin Zobel

doi:10.1093/database/baw163

Qingyu Chen, Karin Verspoor + Show 1 more

Open Access

PDF Available

https://doi.org/10.1093/database/baw163

Copy DOI

Export

Save

Cite

Journal: Database	Publication Date: Jan 1, 2017
Citations: 48	License type: cc-by

Affiliation: University of Melbourne

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

GenBank, the EMBL European Nucleotide Archive and the DNA DataBank of Japan, known collectively as the International Nucleotide Sequence Database Collaboration or INSDC, are the three most significant nucleotide sequence databases. Their records are derived from laboratory work undertaken by different individuals, by different teams, with a range of technologies and assumptions and over a period of decades. As a consequence, they contain a great many duplicates, redundancies and inconsistencies, but neither the prevalence nor the characteristics of various types of duplicates have been rigorously assessed. Existing duplicate detection methods in bioinformatics only address specific duplicate types, with inconsistent assumptions; and the impact of duplicates in bioinformatics databases has not been carefully assessed, making it difficult to judge the value of such methods. Our goal is to assess the scale, kinds and impact of duplicates in bioinformatics databases, through a retrospective analysis of merged groups in INSDC databases. Our outcomes are threefold: (1) We analyse a benchmark dataset consisting of duplicates manually identified in INSDC—a dataset of 67 888 merged groups with 111 823 duplicate pairs across 21 organisms from INSDC databases – in terms of the prevalence, types and impacts of duplicates. (2) We categorize duplicates at both sequence and annotation level, with supporting quantitative statistics, showing that different organisms have different prevalence of distinct kinds of duplicate. (3) We show that the presence of duplicates has practical impact via a simple case study on duplicates, in terms of GC content and melting temperature. We demonstrate that duplicates not only introduce redundancy, but can lead to inconsistent results for certain tasks. Our findings lead to a better understanding of the problem of duplication in biological databases.Database URL: the merged records are available at https://cloudstor.aarnet.edu.au/plus/index.php/s/Xef2fvsebBEAv9w

Highlights

Many kinds of database contain multiple instances of records
We focus on International Nucleotide Sequence Database Collaboration (INSDC) records that have been reported as duplicates by manual processes and merged
It is necessary to categorize and quantify duplicates to find out distinct characteristics held by different categories and organisms; we suggest that these different duplicate types must be separately addressed in any duplicate detection strategy

Summary

Introduction

Many kinds of database contain multiple instances of records. These instances may be identical, or may be similar but with inconsistencies; in traditional database contexts, this means that the same entity may be described in conflicting ways. 20% of (these) errors require additional rebuild time and effort from both developer and biologist’ [27], ‘The removal of bacterial redundancy in UniProtKB (and normal flux in protein) would have meant that most (>90%) of Pfam (a highly curated protein family database using UniProtKB data) seed alignments would have needed manual verification (and potential modification) . This is one of three benchmarks of duplicates that we have constructed [53] While it is the smallest and most narrowly defined of the three benchmarks, it allows us to investigate the nature of duplication in INSDC as it arises during generation and submission of biological sequences, and facilitates understanding the value of later curation. Databases in the same domain, for example gene annotation, may be specialized for different perspectives, such as annotations on genes in different organisms or different functions, but they arguably belong to the same broad domain

Background

Findings

Conclusion