Patterns In Biological Sequences Research Articles

Finding a vaccine or specific antiviral treatment for a global pandemic of virus diseases (such as the ongoing COVID-19) requires rapid analysis, annotation and evaluation of metagenomic libraries to enable a quick and efficient screening of nucleotide sequences. Traditional sequence alignment methods are not suitable and there is a need for fast alignment-free techniques for sequence analysis. Information theory and data compression algorithms provide a rich set of mathematical and computational tools to capture essential patterns in biological sequences. In this study, we investigate the use of compression-complexity (Effort-to-Compress or ETC and Lempel-Ziv or LZ complexity) based distance measures for analyzing genomic sequences. The proposed distance measure is used to successfully reproduce the phylogenetic trees for a mammalian dataset consisting of eight species clusters, a set of coronaviruses belonging to group I, group II, group III, and SARS-CoV-1 coronaviruses, and a set of coronaviruses causing COVID-19 (SARS-CoV-2), and those not causing COVID-19. Having demonstrated the usefulness of these compression complexity measures, we employ them for the automatic classification of COVID-19-causing genome sequences using machine learning techniques. Two flavors of SVM (linear and quadratic) along with linear discriminant and fine K Nearest Neighbors classifer are used for classification. Using a data set comprising 1001 coronavirus sequences (causing COVID-19 and those not causing COVID-19), a classification accuracy of 98% is achieved with a sensitivity of 95% and a specificity of 99.8%. This work could be extended further to enable medical practitioners to automatically identify and characterize coronavirus strains and their rapidly growing mutants in a fast and efficient fashion.

Read full abstract

A significant approach for the discovery of biological regulatory rules of genes, protein and their inheritance relationships is the extraction of meaningful patterns from biological sequence data. The existing algorithms of sequence pattern discovery, like MSPM and FBSB, suffice their low efficiency and accuracy. In order to deal with this issue, this paper presents a new algorithm for biological sequence pattern mining abbreviated MpBsmi based on the data index structure. The MpBsmi algorithm employs a sequence position table abbreviated ST and a sequence database index structure named DB-Index for data storing, mining and pattern expansion. The ST and DB-Index of single items are firstly obtained through scanning sequence database once. Then a new algorithm for fast support counting is developed to mine the table ST to identify the frequent single items. Based on a connection strategy, the frequent patterns are expanded and the expanded table ST is updated by scanning the DB-Index. The fast support counting algorithm is used for obtaining the frequent expansion patterns. Finally, a new pruning technique is developed for extended pattern to avoid the generation of unnecessarily large number of candidate patterns. The experiments results on multiple classical protein sequences from the Pfam database validate the performance of the proposed algorithm including the accuracy, stability and scalability. It is showed that the proposed algorithm has achieved the better space efficiency, stability and scalability comparing with MSPM, FBSB which are the two main algorithms for biological sequence mining.

Read full abstract

Patterns In Biological Sequences Research Articles

Related Topics

Articles published on Patterns In Biological Sequences

The potential and pitfalls of large language models in molecular biosciences

Trie-PMS8: A trie-tree based robust solution for planted motif search problem

Compression-Complexity Measures for Analysis and Classification of Coronaviruses

Neural network architecture search with AMBER

EMS3: An Improved Algorithm for Finding Edit-Distance Based Motifs.

Asymmetron: a toolkit for the identification of strand asymmetry patterns in biological sequences.

MFEA: An evolutionary approach for motif finding in DNA sequences

Discovering of gapped motifs using particle swarm optimisation

Frequent Patterns Algorithm of Biological Sequences based on Pattern Prefix-tree

Patscanui: an intuitive web interface for searching patterns in DNA and protein data.

MpBsmi: A new algorithm for the recognition of continuous biological sequence pattern based on index structure

WITHDRAWN: Biological Sequence Pattern Mining Algorithm Based on Data Index Technology

A Method to Avoid Gapped Sequential Patterns in Biological Sequences: Case Study: HIV and Cancer Sequences

Randomised sequential and parallel algorithms for efficient quorum planted motif search

SeqFeatR for the Discovery of Feature-Sequence Associations.

Glucose-Based Regulation of miR-451/AMPK Signaling Depends on the OCT1 Transcription Factor

QPMS9: an efficient algorithm for quorum Planted Motif Search.

TrieAMD: a scalable and efficient apriori motif discovery approach.

An improved voting algorithm for planted (l, d) motif search

DFSP: a Depth-First SPelling algorithm for sequential pattern mining of biological sequences

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Patterns In Biological Sequences Research Articles

Related Topics

Articles published on Patterns In Biological Sequences

The potential and pitfalls of large language models in molecular biosciences

Trie-PMS8: A trie-tree based robust solution for planted motif search problem

Compression-Complexity Measures for Analysis and Classification of Coronaviruses

Neural network architecture search with AMBER

EMS3: An Improved Algorithm for Finding Edit-Distance Based Motifs.

Asymmetron: a toolkit for the identification of strand asymmetry patterns in biological sequences.

MFEA: An evolutionary approach for motif finding in DNA sequences

Discovering of gapped motifs using particle swarm optimisation

Frequent Patterns Algorithm of Biological Sequences based on Pattern Prefix-tree

Patscanui: an intuitive web interface for searching patterns in DNA and protein data.

MpBsmi: A new algorithm for the recognition of continuous biological sequence pattern based on index structure

WITHDRAWN: Biological Sequence Pattern Mining Algorithm Based on Data Index Technology

A Method to Avoid Gapped Sequential Patterns in Biological Sequences: Case Study: HIV and Cancer Sequences

Randomised sequential and parallel algorithms for efficient quorum planted motif search

SeqFeatR for the Discovery of Feature-Sequence Associations.

Glucose-Based Regulation of miR-451/AMPK Signaling Depends on the OCT1 Transcription Factor

QPMS9: an efficient algorithm for quorum Planted Motif Search.

TrieAMD: a scalable and efficient apriori motif discovery approach.

An improved voting algorithm for planted (l, d) motif search

DFSP: a Depth-First SPelling algorithm for sequential pattern mining of biological sequences