Machine learning algorithms can be utilized to tackle genomic challenges in biological systems, such as identifying therapeutic targets for drug development and searching for effective treatments for chronic diseases like cancer. Triplex-forming oligonucleotides (TFOs) can recognize specific sites in the major groove of duplex DNA sequences, aiding in the identification of therapeutic targets via computational methods. This study aims to compare the performance of five classical machine learning algorithms—Support Vector Machine (SVM), k-Nearest Neighbours (kNN), Decision Tree (DT), Random Forest (RF), and eXtreme Gradient Boosting (XGBoost)—in classifying single-stranded DNA into TFOs. Three datasets were created from an oligo library targeting the supF gene, developed by Kaufmann et al., and used to train the algorithms. The classifiers were optimized through hyperparameter tuning and their performance was evaluated using accuracy, precision, recall, and F1-score metrics. SVM and kNN achieved less than 90 % accuracy, while DT, RF, and XGBoost, which used tree-based methods, attained over 90 % accuracy. The XGBoost model showed the best performance with over 96 % accuracy in classifying single-stranded DNA into TFOs. This study demonstrates that supervised machine learning techniques are effective for accurately classifying single-stranded DNA into TFOs.
Read full abstract