Abstract

: https://bitbucket.org/biodbqual/benchmarks.

Highlights

  • Sequencing technologies are producing massive volumes of data

  • We introduce three benchmarks containing INSDC duplicates that were collected based on three different principles: records merged directly in INSDC (111,826 pairs); INSDC records labelled as references during UniProtKB/Swiss-Prot expert curation (2 465 891 pairs); and INSDC records labelled as references in UniProtKB/ TrEMBL automatic curation (473 555 072 pairs);

  • The other two benchmarks are the expert curation and automatic curation benchmarks. Construction of these benchmarks of duplicate nucleotide records is based on the mapping between INSDC and protein databases (UniProtKB/Swiss-Prot and UniProtKB/TrEMBL), and consists of two main steps

Read more

Summary

Introduction

Sequencing technologies are producing massive volumes of data. GenBank, one of the primary nucleotide databases, increased in size by over 40% in 2014 alone [1]. Researchers have been concerned about the underlying data quality in biological sequence databases since the 1990s [2]. A particular problem of concern is duplicates, when a database contains multiple instances representing the same entity. Recent studies have noted duplicates as one of five central data quality problems [5], and it has been observed that detection and removal of duplicates is a key early step in bioinformatics database curation [6]. In the context of general databases, the problems of quality control and duplicate detection have a long history of research. We review prior work on duplicate detection in bioinformatics databases. We describe the data quality control in INSDC, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL, as the sources for construction of the duplicate benchmark sets that we introduce

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call