Abstract

BackgroundExisting tree and forest methods are powerful bioinformatics tools to explore high dimensional data including high throughput genomic data. However, they cannot deal with the data generated by recent genotyping platforms for single nucleotide polymorphisms due to the massive size of the data and its excessive memory demand.ResultsUsing the recursive partitioning technique, we developed a new software package, Willows, to maximize the utility of the computer memory and make it feasible to analyze massive genotype data. This package includes three tree-based methods – classification tree, random forest, and deterministic forest, and can efficiently handle the massive amount of SNP data. In addition, this package can easily set different options (e.g., algorithms and specifications) and predict the class of test samples.ConclusionWe developed Willows in a user friendly interface with the goal of maximizing the use of memory, which is critical for analysis of genomic data. The Willows package is well documented and publicly available at .

Highlights

  • Existing tree and forest methods are powerful bioinformatics tools to explore high dimensional data including high throughput genomic data

  • The genotype data from the Framingham Heart Study (FHS, 9,300 subjects and 550,000 single nucleotide polymorphisms (SNPs)) require more than 38.1 GB memory for input when each genotype at a SNP marker is stored in the double data type or 4.8 GB when stored in the byte type

  • It is noteworthy that PLINK [15] and Chen, et al [16] already utilize efficient memory use algorithms similar to what we propose to use in trees and forests, and the compressed data format designed by PLINK has been adopted by NCBI to distribute genomewide association (GWA) data

Read more

Summary

Results

Using the recursive partitioning technique, we developed a new software package, Willows, to maximize the utility of the computer memory and make it feasible to analyze massive genotype data. This package includes three tree-based methods – classification tree, random forest, and deterministic forest, and can efficiently handle the massive amount of SNP data. This package can set different options (e.g., algorithms and specifications) and predict the class of test samples

Background
Results and discussion
Conclusion
17. Breiman L
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call