BackgroundCurrent laboratory tests are less than 50% accurate in distinguishing between people who have food allergies (FA) and those who are merely sensitized to foods, resulting in the use of expensive and potentially dangerous Oral Food Challenges. This study presents a purely-computational machine learning approach, conducted using DNA Methylation (DNAm) data, to accurately diagnose food allergies and potentially find epigenetic targets for the disease.Methods and resultsAn unbiased feature-selection pipeline was created that narrowed down 405,000+ potential CpG biomarkers to 18. Machine-learning models that utilized subsets of this 18-feature aggregate achieved perfect classification accuracy on completely hidden test cohorts (on an 8-fold hidden dataset). Ensemble classification was also shown to be effective for this High Dimension Low Sample Size (HDLSS) DNA methylation dataset. The efficacy of these machine learning classifiers and the 18 CpGs was further validated by their high accuracy on a large number of hidden data permutations, where the samples in the training, cross-validation, and hidden sets were repeatedly randomly allocated. The 18-CpG signature mapped to 13 genes, on which biological insights were collected. Notably, many of the FA-discriminating genes found in this study were strongly associated with the immune system, and seven of the 13 genes were previously associated with FA.ConclusionsPrevious studies have also created highly-accurate classifiers for this dataset, using both data-driven and a priori biological insights to construct a 96-CpG signature. This research builds on previous work because it uses a completely computational approach to obtain a perfect classification accuracy while using only 18 highly discriminating CpGs (0.005% of the total available features). In machine learning, simpler models, as used in this study, are generally preferred over more complex ones (other things being equal). Lastly, the completely data-driven methodology presented in this research eliminates the need for a priori biological information and allows for generalizability to other DNAm classification problems.
Read full abstract