DeLUCS: Deep learning for unsupervised clustering of DNA sequences.

Pablo Millán Arias,Fatemeh Alipour,Kathleen A Hill,Lila Kari

doi:10.1371/journal.pone.0261531

Pablo Millán Arias, Fatemeh Alipour + Show 2 more

Open Access

https://doi.org/10.1371/journal.pone.0261531

Copy DOI

Journal: PloS one	Publication Date: Jan 21, 2022
Citations: 25	License type: CC BY 4.0

Affiliation: University of Waterloo, Western University

Abstract

We present a novel Deep Learning method for the Unsupervised Clustering of DNA Sequences (DeLUCS) that does not require sequence alignment, sequence homology, or (taxonomic) identifiers. DeLUCS uses Frequency Chaos Game Representations (FCGR) of primary DNA sequences, and generates “mimic” sequence FCGRs to self-learn data patterns (genomic signatures) through the optimization of multiple neural networks. A majority voting scheme is then used to determine the final cluster assignment for each sequence. The clusters learned by DeLUCS match true taxonomic groups for large and diverse datasets, with accuracies ranging from 77% to 100%: 2,500 complete vertebrate mitochondrial genomes, at taxonomic levels from sub-phylum to genera; 3,200 randomly selected 400 kbp-long bacterial genome segments, into clusters corresponding to bacterial families; three viral genome and gene datasets, averaging 1,300 sequences each, into clusters corresponding to virus subtypes. DeLUCS significantly outperforms two classic clustering methods (K-means++ and Gaussian Mixture Models) for unlabelled data, by as much as 47%. DeLUCS is highly effective, it is able to cluster datasets of unlabelled primary DNA sequences totalling over 1 billion bp of data, and it bypasses common limitations to classification resulting from the lack of sequence homology, variation in sequence length, and the absence or instability of sequence annotations and taxonomic identifiers. Thus, DeLUCS offers fast and accurate DNA sequence clustering for previously intractable datasets.

Highlights

Traditional DNA sequence classification algorithms rely on large amounts of labour intensive and human expert-mediated annotating of primary DNA sequences, informing origin and function
The clusters learned by DeLUCS match true taxonomic groups for large and diverse datasets, with accuracies ranging from 77% to 100%: 2,500 complete vertebrate mitochondrial genomes, at taxonomic levels from sub-phylum to genera; 3,200 randomly selected 400 kbp-long bacterial genome segments, into clusters corresponding to bacterial families; three viral genome and gene datasets, averaging 1,300 sequences each, into clusters corresponding to virus subtypes
DeLUCS is highly effective, it is able to cluster datasets of unlabelled primary DNA sequences totalling over 1 billion bp of data, and it bypasses common limitations to classification resulting from the lack of sequence homology, variation in sequence length, and the absence or instability of sequence annotations and taxonomic identifiers

Summary

Introduction

Traditional DNA sequence classification algorithms rely on large amounts of labour intensive and human expert-mediated annotating of primary DNA sequences, informing origin and function. Some of these genome annotations are not always stable, given inaccuracies and temporary assignments due to limited information, knowledge, or characterization, in some cases. Since there is no taxonomic “ground truth,” taxonomic labels can be subject to dispute (see, e.g., [1,2,3]). As methods for determining phylogeny, evolutionary relationships, and taxonomy evolved from physical to molecular characteristics, this sometimes resulted in a series of changes in taxonomic assignments.

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

DeLUCS: Deep learning for unsupervised clustering of DNA sequences.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one

Lead the way for us

Similar Papers

IDeLUCS: a deep learning interactive tool for alignment-free clustering of DNA sequences.
Pablo Millan Arias ... Kathleen A Hill
Bioinformatics (Oxford, England) | VOL. 39
Pablo Millan Arias, et. al.Pablo Millan Arias ... Kathleen A Hill
17 Aug 2023
Bioinformatics (Oxford, England) | VOL. 39

Predicting In-Vitro DNA-Protein Binding With a Spatially Aligned Fusion of Sequence and Shape.
Qinhu Zhang ... Zhan-Heng Chen
IEEE/ACM Transactions on Computational Biology and Bioinformatics | VOL. 19
Qinhu Zhang, et. al.Qinhu Zhang ... Zhan-Heng Chen
01 Nov 2022
IEEE/ACM Transactions on Computational Biology and Bioinformatics | VOL. 19

Mapping the space of genomic signatures.
Lila Kari ... Nathaniel Bryans
PLOS ONE | VOL. 10
Lila Kari, et. al.Lila Kari ... Nathaniel Bryans
22 May 2015
PLOS ONE | VOL. 10

SELF-SIMILARITY LIMITS OF GENOMIC SIGNATURES
Zuo-Bing Wu
Fractals | VOL. 11
Zuo-Bing WuZuo-Bing Wu
01 Mar 2003
Fractals | VOL. 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

DeLUCS: Deep learning for unsupervised clustering of DNA sequences.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one