In silico Identification and Assessment of Potential Chloroplast DNA Barcodes for Discriminating Brassicaceous Species using Machine Learning Algorithms

Bhupinder Pal Singh,Ajay Kumar,Avinash Kaur Nagpal,Harpreet Singh

doi:10.25303/1701rjbt6496

Abstract

Complete chloroplast genome sequences of 89 Brassicaceous species (42 genera) were used for identification and assessment of potential DNA barcodes at genus and species levels. Sliding windows analysis was performed on the aligned file to identify hyper-variable regions based on nucleotide diversity(π). Out of 23 identified hyper-variable regions, 3 coding regions i.e. ycf1, ndhF and ndhA and 3 combinations of coding and non-coding regions i.e. ‘ndhH-rps15, rps15, rps15-ycf1, ycf1’; ‘ccsA/ycf5, ccsA/ ycf5-ndhD, ndhD’ and ‘ndhE-ndhG, ndhG, ndhG-ndhI, ndhI’ were selected for sequence enrichment and assessed using six supervised machine learning algorithms i.e. J48, Jrip, SMO, Naive Bayes, Random Forest and KNN using WEKA along with distance based method using ‘nearneighbour’ function in SPIDER. It was observed that ycf1 was the most efficient region for discriminating Brassicaceous species with average identification rate of 76% and maximum identification rate of 86% at species level. The other three regions i.e. ndhA, ‘ndhH-rps15,rps15,rps15-ycf1,ycf1’ and ‘ccsA/ycf5,ccsA/ycf5-ndhD,ndhD’ were found to be more efficient than well established markers i.e. matK and rbcL and hence can be used as potential DNA barcodes for family Brassicaceae. Supervised machine learning algorithms i.e. SMO, Random Forest and KNN along with distance based method SPIDER(NN) were shown to be more efficient and stable as compared to Jrip, J48 and Naive Bayes.

Full Text