Supervised Learning for Detection of Duplicates in Genomic Sequence Databases.

Qingyu Chen,Justin Zobel,Xiuzhen Zhang,Karin Verspoor

doi:10.1371/journal.pone.0159644

Qingyu Chen, Justin Zobel + Show 2 more

Open Access

https://doi.org/10.1371/journal.pone.0159644

Copy DOI

Journal: PLOS ONE	Publication Date: Aug 4, 2016
Citations: 16	License type: CC BY 4.0

Affiliation: University of Melbourne, RMIT University

Abstract

MotivationFirst identified as an issue in 1996, duplication in biological databases introduces redundancy and even leads to inconsistency when contradictory information appears. The amount of data makes purely manual de-duplication impractical, and existing automatic systems cannot detect duplicates as precisely as can experts. Supervised learning has the potential to address such problems by building automatic systems that learn from expert curation to detect duplicates precisely and efficiently. While machine learning is a mature approach in other duplicate detection contexts, it has seen only preliminary application in genomic sequence databases.ResultsWe developed and evaluated a supervised duplicate detection method based on an expert curated dataset of duplicates, containing over one million pairs across five organisms derived from genomic sequence databases. We selected 22 features to represent distinct attributes of the database records, and developed a binary model and a multi-class model. Both models achieve promising performance; under cross-validation, the binary model had over 90% accuracy in each of the five organisms, while the multi-class model maintains high accuracy and is more robust in generalisation. We performed an ablation study to quantify the impact of different sequence record features, finding that features derived from meta-data, sequence identity, and alignment quality impact performance most strongly. The study demonstrates machine learning can be an effective additional tool for de-duplication of genomic sequence databases. All Data are available as described in the supplementary material.

Highlights

Duplication is a central data quality problem, impacting the volume of data that must be processed during data curation and computational analyses and leading to inconsistencies when contradictory or missing information on a given entity appears in a duplicated record
Records with default 90% sequence identity are considered as duplicates in methods such as CD-HIT [2]
We make the following contributions: (1) we explore a supervised duplicate-detection model for pairs of genomic database records, proposing a feature representation based on 22 distinct attributes of record pairs, testing three learning algorithms, and experimenting with both binary and multi-class classification strategies, (2) we train and test the models with a data set of over one million expert-curated pairs across five organisms, and (3) we demonstrate that our proposed models strongly outperform a genomic sequence identity baseline

Summary

Introduction

Duplication is a central data quality problem, impacting the volume of data that must be processed during data curation and computational analyses and leading to inconsistencies when contradictory or missing information on a given entity appears in a duplicated record. Supervised Biological Duplicate Record Detection genomic sequence databases, duplication has been a recognised issue since the 1990s [1]. Existing duplicate detection methods in sequence databases fall into two categories. One category defines duplicates using simple heuristics. These methods are very efficient, but may be overly simplistic, resulting in high levels of both false positive and false negative detections. Records with default 90% sequence identity are considered as duplicates in methods such as CD-HIT [2]. Those methods can efficiently cluster sequences into groups. At least two questions remain: (1) Are records with high sequence identity really duplicates? The dataset in one representative method only has duplicates with exact sequences [3], whereas duplicates could be fragments or even sequences with relatively low identity, as we illustrate in this paper

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Supervised Learning for Detection of Duplicates in Genomic Sequence Databases.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE

Lead the way for us

Similar Papers

Annotation and <i>in silico</i> localization of the Affymetrix GeneChip Porcine Genome Array
W Naraballobh ... S Ponsuksili
Archives Animal Breeding | VOL. 53
W Naraballobh, et. al.W Naraballobh ... S Ponsuksili
10 Oct 2010
Archives Animal Breeding | VOL. 53

An ACGT-Words Tree for Efficient Data Access in Genomic Databases
Ye-In Chang ... Wei-Horng Yeh
-
Ye-In Chang, et. al. Ye-In Chang ... Wei-Horng Yeh
01 Apr 2007
01 Apr 2007

A comparison of binary and multiclass support vector machine models for volcanic lithology estimation using geophysical log data from Liaohe Basin, China
Dan Mou ... Zhu-Wen Wang
Exploration Geophysics | VOL. 47
Dan Mou, et. al.Dan Mou ... Zhu-Wen Wang
01 Jun 2016
Exploration Geophysics | VOL. 47

Identification of Novel Cancer Target Antigens Utilizing EST and Genome Sequence Databases
Tapan K Bera ... Kristi A Egland
-
Tapan K Bera, et. al.Tapan K Bera ... Kristi A Egland
01 Jan 2004
01 Jan 2004

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Supervised Learning for Detection of Duplicates in Genomic Sequence Databases.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE