We are most pleased Statnikov et al. (2010)xStatnikov, A., McVoy, L., Lytkin, N., and Aliferis, C.F. Cell Host Microbe. 2010; 7: 100–101Abstract | Full Text | Full Text PDF | PubMed | Scopus (5)See all ReferencesStatnikov et al. (2010) could utilize the gene-expression data from Zaas et al. (Zaas et al., 2009xZaas, A., Chen, M., Varkey, J., Veldman, T., Hero, A.O. III, Lucas, J., Huang, Y., Turner, R., Gilbert, A., Lambkin-Williams, R. et al. Cell Host Microbe. 2009; 6: 207–217Abstract | Full Text | Full Text PDF | PubMed | Scopus (138)See all ReferencesZaas et al., 2009) to develop a similar gene-expression signature for viral infection. It is not surprising that microarray data can result in predictors that are not unique and that can offer signatures of comparable predictive performance. Ein-Dor et al. (2005)xEin-Dor, L., Kela, I., Getz, G., Givol, D., and Domany, E. Bioinformatics. 2005; 21: 171–178Crossref | PubMed | Scopus (481)See all ReferencesEin-Dor et al. (2005) identified at least eight separate gene sets whose expression patterns stratify the same patient into comparable risk categories for breast cancer suggesting that there are multiple forms of information in a single data set that can equally well stratify patients. The fundamental reason is the diversity, scale, and complexity of genomic factors defining the disease state.We appreciate Stanlikov et al.'s suggestions for improving the analysis and respond below in order of the suggestions. First, genes were selected based on coherent expression across the entire data set. Fitting the factor model was done without regard to phenotype. The appropriate factor from which to select genes was clear regardless of the samples used in cross-validation. Genes were selected and the model was built on the training data set only. Second, it is possible that the gene-expression patterns at baseline are predictive of symptomatic versus asymptomatic at the time of maximum symptoms (T). If this were the case, however, the effect would work against our ability to generate a model to distinguish between symptomatic and asymptomatic or the baseline healthy state. Thus, we are able to distinguish symptomatic from asymptomatic/baseline despite rather than because of any effect linking samples from baseline and T. Third, our analysis was purely unsupervised—no use of the phenotype data was made in the process of fitting the factor models. Part of the hierarchical structure of the Bayesian model includes the estimation of factor size, which is the proper technique for accounting for false discovery in this context. The choice of 30 genes was arbitrary but this choice does not in any way lessen the value of the result. Our goal was to present a gene set in the factors that might be amenable to a clinical assay such as RT PCR. Fourth, Statnikov et al. are correct that only one of the three viruses were in Ramilo et al. (2007)xRamilo, O., Allman, W., Chung, W., Mejias, A., Ardura, M., Glaser, C., Wittkowski, K.M., Piqueras, B., Banchereau, J., Palucka, A.K. et al. Blood. 2007; 109: 2066–2077Crossref | PubMed | Scopus (223)See all ReferencesRamilo et al. (2007). However, the goal of this aspect of our analysis was to show that the “acute respiratory viral signature” could identify one of the relevant viruses in the training set. Moreover, the Ramilo data set was generated on an independent microarray platform from that of ours. The fact that we could achieve such a good validation speaks to the robust nature of our acute respiratory signature. It should also be noted that we did treat each of the viral data sets as validations of the other (closer to “apples to apples”), and we are not aware of the existence of any public data set that is more appropriate for validation than Ramilo. Fifth, our work was not designed to demonstrate a complete analysis of the data, but rather to demonstrate the possibility of distinguishing infectious agents based on host response.
Read full abstract