Abstract

In molecular biology, biological macromolecules, like desoxyribonucleic acids (DNA) and proteins are coded by strings, called ‘primary structures’. For a long time, biologists gathered these primary structures in large databases. Now, they focus on analyzing these primary structures in order to extract useful knowledge. Data mining approaches can be helpful to reach this goal. In this paper, we present a data mining approach based on machine learning techniques to do classification of biological sequences. By using our approach, we use four steps as follows. (1) In the first step, we construct the set of the discriminant substrings, called discriminant descriptor (DD), associated with each family of primary structures. This construction is made thanks to an adaptation of the Karp, Miller and Rosenberg (KMR) algorithm. (2) In the second step, we use the DDs constructed during the first step to code the families of primary structures by a table of examples vs attributes, called ‘context’. (3) In the third step, we extract knowledge from the context constructed during the second step and represent it by production rules. This extraction is made by using an incremental production rules approach. (4) Finally, during the last step, we use the obtained production rules to do classification of primary structures.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.