Encoding dissimilarity data for statistical model building

Grace Wahba

doi:10.1016/j.jspi.2010.04.025

Abstract

We summarize, review and comment upon three papers which discuss the use of discrete, noisy, incomplete, scattered pairwise dissimilarity data in statistical model building. Convex cone optimization codes are used to embed the objects into a Euclidean space which respects the dissimilarity information while controlling the dimension of the space. A “newbie” algorithm is provided for embedding new objects into this space. This allows the dissimilarity information to be incorporated into a smoothing spline ANOVA penalized likelihood model, a support vector machine, or any model that will admit reproducing kernel Hilbert space components, for nonparametric regression, supervised learning, or semisupervised learning. Future work and open questions are discussed. The papers are: (1) Lu, F., Keles, S., Wright, S., Wahba, G., 2005a. A framework for kernel regularization with application to protein clustering. Proc. Natl. Acad. Sci. 102, 12332–12337. (2) Corrada Bravo, G., Wahba, G., Lee, K., Klein, B., Klein, R., Iyengar, S., 2009. Examining the relative influence of familial, genetic and environmental covariate information in flexible risk models. Proc. Natl. Acad. Sci. 106, 8128–8133. (3) Lu, F., Lin, Y., Wahba, G., 2005b. Robust manifold unfolding with kernel regularization. Technical Report 1008, Department of Statistics, University of Wisconsin-Madison.

Full Text