Abstract

Biological pairwise sequence alignment can be used as a method for arranging two biological sequence characters to identify regions of similarity. This operation has elicited considerable interest due to its significant influence on various critical aspects of life (e.g., identifying mutations in coronaviruses). Sequence alignment over large databases cannot yield results within a reasonable time, power, and cost. heuristic methods, such as FASTA, the BLAST family have been demonstrated to perform 40 times faster than DP-based (e.g., Needleman-Wunsch) techniques they cannot guarantee an optimum alignment result An optimized software platform of a widely used DNA sequence alignment algorithm called the Needleman-Wunsch (NW) algorithm based on a lookup table, is described in this study. This global alignment algorithm is the best approach for identifying similar regions between sequences. This study presents a new application of classical machine learning (ML) to global sequence alignment. Customized ML models are used to implement NW global alignment. An accuracy of 99.7% is achieved when using a multilayer perceptron with the ADAM optimizer, and up to 2912 Giga cell updates per second are realized on two real DNA sequences with a length of 4.1 M nucleotides. Our implementation is valid for RNA/DNA sequences. This study aims to parallelize the computation steps involved in the algorithm to accelerate its performance by using ML algorithms. All datasets used in this study are available from https://ieee-dataport.org/documents/dna-sequence-alignment-datasets-based-nw-algorithm.

Highlights

  • Bioinformatics has developed due to the need for understanding the code of life, i.e., deoxyribonucleic acid (DNA)

  • Bioinformatics is an integration of biology and informatics because it includes the innovation of using computers in the measurement, recovery, control, and appropriation of information related to natural macromolecules, such as DNA, Ribonucleic acid (RNA), and proteins

  • PROPOSED ALGORITHM In the current study, we propose the use of equal-length sequences that can be applied to DNA or RNA sequences because DNA and RNA sequences consist of four letters of the alphabet that represent the four NTs

Read more

Summary

Introduction

Bioinformatics has developed due to the need for understanding the code of life, i.e., deoxyribonucleic acid (DNA). Bioinformatics is an integration of biology and informatics because it includes the innovation of using computers in the measurement, recovery, control, and appropriation of information related to natural macromolecules, such as DNA, RNA, and proteins. Research endeavors in this field include genome assembly, sequence alignment, drug design, gene finding, drug discovery, protein structure alignment, and protein structure prediction [2]. Match ← H(i−1, j−1) + S(Ai, Bj) Delete ← H(i−1, j) + W Insert ← H(i, j−1) + W H(i,j) ← max(Match, Insert, Delete) } This algorithm requires too long running time (O(MN)) when aligning two, extremely long sequences. It Can be applied to problems that consist of overlapping subproblems (e.g., two unequal length sequences)

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call