Sesquiterpene Lactones-Based Classification of the Family Asteraceae Using Neural Networks and k-Nearest Neighbors

Dimitar Hristozov,Johann Gasteiger,Fernando B Da Costa

doi:10.1021/ci060046x

Abstract

In a recent publication we described the application of an unsupervised learning method using self-organizing maps to the separation of three tribes and seven subtribes of the plant family Asteraceae based on a set of sesquiterpene lactones (STLs) isolated from individual species. In the present work, two different structure representations--atom counts (2D) and radial distribution function (RDF) (3D)--and two supervised classification methods--counterpropagation neural networks and k-nearest neighbors (k-NN)--were used to predict the tribe in which a given STL occurs. The data set was extended from 144 to 921 STLs, and the Asteraceae tribes were augmented from three to seven. The k-NN classifier with k = 1 showed the best performance, while the RDF code outperformed the atom counts. The quality of the obtained model was assessed with two test sets, which exemplified two possible applications: (1) finding a plant source for a desired compound and (2) based on a plant species chemical profile (STLs): (a) study the relationship between the current taxonomic classification and plant's chemistry and (b) assign a species to a tribe by majority vote. In addition, the problem of defining the applicability domain of the models was assessed by means of two different approaches-principal component analysis combined with Hotelling T2 statistic and an a posteriori probability-based rule.

Full Text