Vector space classification of DNA sequences

H.-M Müller,S.E Koonin

doi:10.1016/s0022-5193(03)00082-1

Abstract

Revisiting the problem of intron–exon identification, we use a principal component analysis (PCA) to classify DNA sequences and present first results that validate our approach. Sequences are translated into document vectors that represent their word content; a principal component analysis then defines Gaussian-distributed sequence classes. The classification uses word content and variation of word usage to distinguish sequences. We test our approach with several data sets of genomic DNA and are able to classify introns and exons with an accuracy of up to 96%. We compare the method with the best traditional coding measure, the non-overlapping hexamer frequency count, and find that the PCA method produces better results. We also investigate the degree of cross-validation between different data sets of introns and exons and find evidence that the quality of a data set can be detected.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Vector space classification of DNA sequences

Abstract

Talk to us

Similar Papers

More From: Journal of Theoretical Biology

Lead the way for us

Journal: Journal of Theoretical Biology	Publication Date: Apr 24, 2003
Citations: 18

Similar Papers

DNA and peptide sequences and chemical processes multivariately modelled by principal component analysis and partial least-squares projections to latent structures
S Wold ... S Rännar
Analytica Chimica Acta | VOL. 277
S Wold, et. al.S Wold ... S Rännar
01 May 1993
Analytica Chimica Acta | VOL. 277

Author response: Sparse dimensionality reduction approaches in Mendelian randomisation with highly correlated exposures
Vasileios Karageorgiou ... Dipender Gill
-
Vasileios Karageorgiou, et. al.Vasileios Karageorgiou ... Dipender Gill
28 Nov 2022
28 Nov 2022

A measure of DNA sequence similarity by Fourier Transform with applications on hierarchical clustering
Changchuan Yin ... Stephen S.-T Yau
Journal of Theoretical Biology | VOL. 359
Changchuan Yin, et. al.Changchuan Yin ... Stephen S.-T Yau
06 Jun 2014
Journal of Theoretical Biology | VOL. 359

Extension of molecular similarity analysis approach to classification of DNA sequences using DNA descriptors
R Jayalakshmi ... M Vivekanandan
SAR and QSAR in Environmental Research | VOL. 22
R Jayalakshmi, et. al.R Jayalakshmi ... M Vivekanandan
01 Jan 2010
SAR and QSAR in Environmental Research | VOL. 22

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Vector space classification of DNA sequences

Abstract

Talk to us

Similar Papers

More From: Journal of Theoretical Biology