Is BCH Code Useful to DNA Classification as an Alignment-Free Method?

Taciana A. De Souza,Milena M. Arruda,Francisco M. De Assis

doi:10.1109/access.2021.3078138

Abstract

Similarities between biological and digital communication systems have been investigated since biology also uses a discrete alphabet to represent and transmit information. The genetic information of an organism is encoded in DNA molecules by units called bases. However, there is no a definitive model and the question as what error-correcting code underlies DNA sequences remains an open problem. Recent works show that DNA sequences can be identified as codewords in a class of cyclic error-correcting codes known as BCH codes. We propose improvements regarding the code construction process that resulted in a novel algorithm for searching BCH codes whose codeword differ from a given DNA sequence (mapped to finite field $\mathbb {F}_{4}$ ) in up to only one symbol. The most important improvement is to replace brute force decoding with syndrome decoding. In this sense, based on a statistical analysis, we verify whether in a collection of sequences with the same taxonomic rank there is a code that identifies most of these sequences, called dominant code. Furthermore, we check whether the dominant code can provides a biological information to DNA classification being an alignment-free method. Finally, we show that the probability of a DNA sequences with odd-length $n$ be identified by a BCH code tends to analytical probability of the same code identifying a random vector.

Highlights

The use of coding and information theory tools has been proposed in bioinformatics
In this paper, besides proposing improvements in the DNA Sequence Generation Algorithm that resulted in a novel algorithm, we investigate if it is effective to determine the BCH code that identifies most of DNA sequences in a collection, where the sequences stem from neighboring organisms in a phylogenetic tree
For a given DNA sequence, the following verification is repeated: the parity-check matrix of a code is used to decide whether a given DNA sequence is a codeword, so brute force is used to analyze the sequences with only one different nucleotide

Summary

INTRODUCTION

The use of coding and information theory tools has been proposed in bioinformatics. For example, DNA based data storage systems [1], [2], hiding data in DNA [3], [4] and find error-correcting code underlying DNA sequences [5], [6]. In this paper, besides proposing improvements in the DNA Sequence Generation Algorithm that resulted in a novel algorithm, we investigate if it is effective to determine the BCH code that identifies most of DNA sequences in a collection, where the sequences stem from neighboring organisms in a phylogenetic tree. We refer to this code as the dominant code.

PRELIMINARIES

1: Initialize r to DNA sequence

EXPERIMENTAL DATA

BIOLOGICAL ANALYSIS

CONCLUSION