Abstract

Predicting DNase I hypersensitive sites (DHSs) is an essential topic in the field of transcriptional regulatory elements, which provides clues for deciphering the function of noncoding genomic regions. To the best of our knowledge, several computational approaches are currently available for prediction of DHSs in the plant genome, but there is still room for improvement. In the present work, a DS evidence theory-based method was proposed. At first, four sequence-derived feature representation methods, i.e., kmer, reverse complement kmers, mismatch profile, and pseudo dinucleotide composition, were utilized to encode the sequences. Then, four support vector machine based sub-classifiers was built with these sequence-derived features. Finally, the DS evidence theory was applied to obtain the final results by fusing the outputs of these four base learners. In this work, to solve the data imbalance problem, a bidirectional synthetic sampling algorithm was proposed to obtain balanced dataset during training the models. In the computational experiments, the proposed method achieved accuracy up to 88.85%, and 88.60% in Arabidopsis thaliana and rice genome, respectively. Compared with existing DHSs prediction methods, the proposed method can achieve comparable or better performances, suggesting the usefulness of the method for DHSs prediction.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call