Abstract

Big data in area of molecular biology has increased rapidly since Next-Generation Sequencing (NGS) technology introduced, a new technology used to sequence DNA with high throughput. Identification of polymorphism in nucleotide is an upstream analysis for some downstream analysis such as producing quality seed based on molecular marker for plant breeding. This paper discusses identification of Single Nucleotide Polymorphism (SNP) underlying NGS data of cultivated soybean (Glycine max L) using CART (Classification and Regression Tree). The Identification showed that 51% of true positive SNP could be identified with precision 67%. In order to increase model's performance, Bootstrap Aggregating (bagging) CART was developed with varied number of bootstrap (11, 21, 31, 41, 51, 61, 71, 81, 91). The evaluation indicated that TPR and precision was trade off, when model's TPR was increase the precision one would be decreased. Because of that, F-measure was used as metrics of evaluation. Bagging CART with 51 bootstrap was the best model since it could identify 60% of true positive SNP with precision 66% and F-measure 0.63, while F-measure of model with raw CART was 0.58. The results pointed out that, applying bagging in CART could increase model's performance based on F-measure.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call