Abstract

The size of data sets produced in genetic experiments is steadily increasing. Very often there are many more variables than observations, leading to the so-called ``large $p$, small $n$" problem. For such data, clustering and distance based procedures are useful tools for identifying groups of variables associated with outcomes of interest. We develop a novel approach using mutual information as a measure of distance (here dependency) between probability distributions that is valid for comparisons between pairs of variables that are both continuous, both discrete, or one of each. This gives an overall information matrix to be used as a distance matrix in clustering procedures and to define a so-called weighted network of associations between variables. We present computational aspects of implementing our procedures in R. References T. M. Cover and J. A. Thomas. Elements of information theory. Wiley, 2006. doi:10.1002/0471200611. Z. Dawy, B. Goebel, J. Hagenauer, C. Andreoli, T. Meitinger, and J. C. Mueller. Gene mapping and marker clustering using Shannon's mutual information. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 3(1):47--56, 2006. doi:10.1109/TCBB.2006.9. J. J. Faith, B. Hayete, J. T. Thaden, I. Mogno, J. Wierzbowski, G. Cottarel, S. Kasif, J. J. Collins, and T. S. Gardner. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol, 5(1):e8, 2007. doi:10.1371/journal.pbio.0050008. T. F. Fuller, A. Ghazalpour, J. E. Aten, T. A. Drake, A. J. Lusis, and S. Horvath. Weighted gene coexpression network analysis strategies applied to mouse weight. Mammalian Genome, 18(6):463--472, 2007. doi:10.1007/s00335-007-9043-3. A. Ghazalpour, S. Doss, B. Zhang, S. Wang, C. Plaisier, R. Castellanos, A. Brozell, E. E. Schadt, T. A. Drake, A. J. Lusis, et al. Integrating genetic and network analysis to characterize genes related to mouse weight. PLoS Genet, 2(8):e130, 2006. doi:10.1371/journal.pgen.0020130. Hung T. Nguyen. On modeling of linguistic information using random sets. Information Sciences, 34(3):265 -- 274, 1984. doi:10.1016/0020-0255(84)90052-5. P. Qiu, A. J. Gentles, and S. K. Plevritis. Fast calculation of pairwise mutual information for gene regulatory network reconstruction. Computer Methods and Programs in Biomedicine, 94(2):177--180, 2009. doi:10.1016/0020-0255(84)90052-5. S. J. Sheather and M. C. Jones. A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society. Series B (Methodological), pages 683--690, 1991. M. P. Wand and M. C. Jones. Kernel smoothing. Chapman and Hall/CRC, 1995. B. Zhang and S. Horvath. A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology, 4(1):1128, 2005. doi:10.2202/1544-6115.1128.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.