Abstract

BackgroundSingle nucleotide polymorphisms (SNP) constitute more than 90% of the genetic variation, and hence can account for most trait differences among individuals in a given species. Polymorphism detection software PolyBayes and PolyPhred give high false positive SNP predictions even with stringent parameter values. We developed a machine learning (ML) method to augment PolyBayes to improve its prediction accuracy. ML methods have also been successfully applied to other bioinformatics problems in predicting genes, promoters, transcription factor binding sites and protein structures.ResultsThe ML program C4.5 was applied to a set of features in order to build a SNP classifier from training data based on human expert decisions (True/False). The training data were 27,275 candidate SNP generated by sequencing 1973 STS (sequence tag sites) (12 Mb) in both directions from 6 diverse homozygous soybean cultivars and PolyBayes analysis. Test data of 18,390 candidate SNP were generated similarly from 1359 additional STS (8 Mb). SNP from both sets were classified by experts. After training the ML classifier, it agreed with the experts on 97.3% of test data compared with 7.8% agreement between PolyBayes and experts. The PolyBayes positive predictive values (PPV) (i.e., fraction of candidate SNP being real) were 7.8% for all predictions and 16.7% for those with 100% posterior probability of being real. Using ML improved the PPV to 84.8%, a 5- to 10-fold increase. While both ML and PolyBayes produced a similar number of true positives, the ML program generated only 249 false positives as compared to 16,955 for PolyBayes. The complexity of the soybean genome may have contributed to high false SNP predictions by PolyBayes and hence results may differ for other genomes.ConclusionA machine learning (ML) method was developed as a supplementary feature to the polymorphism detection software for improving prediction accuracies. The results from this study indicate that a trained ML classifier can significantly reduce human intervention and in this case achieved a 5–10 fold enhanced productivity. The optimized feature set and ML framework can also be applied to all polymorphism discovery software. ML support software is written in Perl and can be easily integrated into an existing SNP discovery pipeline.

Highlights

  • Single nucleotide polymorphisms (SNP) constitute more than 90% of the genetic variation, and can account for most trait differences among individuals in a given species

  • Training and test data The training/test candidate polymorphism data for implementing machine learning (ML) algorithms was extracted from a large-scale soybean STS amplification and sequencing project

  • A total of 3332 STS comprising 20 Mb were sequenced from both directions in 6 inbred individuals representing each of 6 diverse soybean genotypes previously identified by Zhu et al [12]

Read more

Summary

Introduction

Single nucleotide polymorphisms (SNP) constitute more than 90% of the genetic variation, and can account for most trait differences among individuals in a given species. We developed a machine learning (ML) method to augment PolyBayes to improve its prediction accuracy. Machine learning programs are advantageous in many cases where the input/output pairs can be specified, but the concise relationship between the input/output pairs is not known. Machine learning programs can help in extracting the complex relationships and correlations hidden in large data sets (a process sometimes known as data mining). The prediction accuracy of different machine learning programs varies and depends on the type of problem, dataset and the algorithm used. Examples of application domains include protein classification[1] tissue classification for different types of cancer[2], protein secondary structure prediction [3], text mining[4], protein-protein interactions[5] and RNA binding proteins[6]. There are several free software suites available, including Weka [7], C4.5 [8], and GIST [9]

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.