Abstract

Within recent years, cheaper, continually improving sequencing and genotyping techniques have resulted in an explosion in high dimensional genomic data. While these information-rich datasets are welcome as potentially useful in areas such as studies of complex diseases and pharmacogenetics, computational difficulties is storage, retrieval and analysis mean that extracting the information relevant to specific research questions is proving to be more difficult than perhaps was originally anticipated. From the statistical modelling perspective, the primary challenge is how to deal with the small number of individuals studied, relative to the number of observed explanatory factors; the so-called curse of dimensionality'. An example is cancer pharmacogenomics, concerned with identification of genetic variants that influence drug responses. Inference based on multiple hypothesis testing (multiple comparison) procedures was initially preferred in high-dimensional genetic studies. This was mainly because of the comparative simplicity of Bonferroni and related error correction methods, and in spite of well-known limitation of conservativeness and inefficient analyses. Currently, in spite of the substantial availability of analysis methodology, such as p-value aggregation, dimension reduction, variable selection, pooling and Bayesian MCMC methods, the problem of inefficient inference remains unresolved as current methods tend to be either highly computationally demanding or complex to implement and interpret. This talk highlights issues in interpreting information in genomic data and presents both theoretical and simulation results to argue for a novel inferential solution based on nonparametric regression that can reliably identify true positives and simultaneously minimize the number of spurious findings.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call