DNA N-gram analysis framework (DNAnamer): A generalized N-gram frequency analysis framework for the supervised classification of DNA sequences

John S Malamon

doi:10.1016/j.heliyon.2024.e36914

Abstract

In 1948, Claude Shannon published a mathematical system describing the probabilistic relationships between the letters of a natural language and their subsequent order or syntax structure. By counting unique, reoccurring sequences of letters called N-grams, this language model was used to generate recognizable English sentences from N-gram frequency probability tables. More recently, N-gram analysis methodologies have been successfully applied to address many complex problems in a variety of domains, from language processing to genomics. One such example is the common use of N-gram frequency patterns and supervised classification models to determine authorship and plagiarism. In this paradigm, DNA is a language model where nucleotides are analogous to the letters of a word and nucleotide N-grams are analogous to the words of a sentence. Because DNA contains highly conserved and identifiable nucleotide sequence frequency patterns, this approach can be applied to a variety of classification and data reduction problems, such as identifying species based on unknown DNA segments. Other useful applications of this methodology include the identification of functional gene elements, microorganisms, sequence contamination, and sequencing artifacts. To this end, I present DNAnamer, a generalized and extensible methodological framework and analysis toolkit for the supervised classification of DNA sequences based on their N-gram frequency patterns.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

DNA N-gram analysis framework (DNAnamer): A generalized N-gram frequency analysis framework for the supervised classification of DNA sequences

Abstract

Talk to us

Similar Papers

More From: Heliyon

Lead the way for us

Journal: Heliyon	Publication Date: Aug 24, 2024
License type: cc-by-nc-nd

Similar Papers

N-gram analysis of 970 microbial organisms reveals presence of biological language models
Hatice Ulku Osmanbeyoglu ... Madhavi K Ganapathiraju
BMC Bioinformatics | VOL. 12
Hatice Ulku Osmanbeyoglu, et. al.Hatice Ulku Osmanbeyoglu ... Madhavi K Ganapathiraju
10 Jan 2011
BMC Bioinformatics | VOL. 12

Comparative n-gram analysis of whole-genome protein sequences
M Ganapathiraju ... J Klein-Seetharaman
-
M Ganapathiraju, et. al.M Ganapathiraju ... J Klein-Seetharaman
01 Jan 2002
01 Jan 2002

Comparative ngram analysis of whole-genome sequences
...
-
, et. al. ...
30 Jun 2018
30 Jun 2018

Contributions to the joint segmentation and classification of sequences (My two cents on decoding and handwriting recognition)
Salvador España Boquera
-
Salvador España BoqueraSalvador España Boquera
05 Apr 2016
05 Apr 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

DNA N-gram analysis framework (DNAnamer): A generalized N-gram frequency analysis framework for the supervised classification of DNA sequences

Abstract

Talk to us

Similar Papers

More From: Heliyon