Pattern Recognition Studies of Complex Chromatographic Data Sets.

P.C Jurs,B.K Lavine,T.R Stouch

doi:10.6028/jres.090.059

Abstract

Chromatographic fingerprinting of complex biological samples is an active research area with a large and growing literature. Multivariate statistical and pattern recognition techniques can be effective methods for the analyisis of such complex data. However, the classification of complex samples on the basis of their chromatographic profiles is complicated by two factors: 1) confounding of the desired group information by experimental variables or other systematic variations, and 2) random or chance classification effects with linear discriminants. We will treat several current projects involving these effects and methods for dealing with the effects. Complex chromatographic data sets often contain information dependent on experimental variables as well as information which differentiates between classes. The existence of these types of complicating relationships is an innate part of fingerprint-type data. ADAPT, an interactive computer software system, has the clustering, mapping, and statistical tools necessary to identify and study these effects in realistically large data sets. In one study, pattern recognition analysis of 144 pyrochromatograms (PyGCs) from cultured skin fibroblasts was used to differentiate cystic fibrosis carriers from presumed normal donors. Several experimental variables (donor gender, chromatographic column number, etc.) were involved in relationships that had to be separated from the sought relationships. Notwithstanding these effects, discriminants were developed from the chromatographic peaks that assigned a given PyGC to its respective class (CF carrier vs normal) largely on the basis of the desired pathological difference. In another study, gas chromatographic profiles of cuticular hydrocarbon extracts obtained from 179 fire ants were analyzed using pattern recognition methods to seek relations with social caste and colony. Confounding relationships were studied by logistic regression. The data analysis techniques used in these two example studies will be presented. Previously, Monte Carlo simulation studies were carried out to assess the probability of chance classification for nonparametric and parametric linear discriminants. The level of expected chance classification as a function of the number of observations, the dimensionality, and the class membership distributions were examined. These simulation studies established limits on the approaches that can be taken with real data sets so that chance classifications are improbable.

Full Text