Abstract

A data set of 297 diverse organic compounds that cause varying degrees of chromosomal aberrations in Chinese hamster lung cells is examined. Responses of an assay are categorized as clastogenic (>10% aberrant cells) and nonclastogenic (<5% aberrant cells). Each of the compounds is represented by calculated structural descriptors that encode topological, geometric, electronic, and polar surface features. A genetic algorithm (GA) employing a k-nearest neighbor (kNN) fitness evaluator is used to iteratively search a reduced descriptor space to find small, information-rich subsets of descriptors that maximize the classification rates for clastogenic and nonclastogenic responses. To further improve modeling, a similarity measure using atom-pair descriptors is employed to create more homogeneous data subsets. Three different data sets are examined. Results for a set of 297 compounds using the GA-kNN method were 86.5% and 80.0% correct classification in the training set and prediction set, respectively. Results for a subset of 279 compounds in model 2 are 85.7% and 85.7% for the training and prediction sets, respectively. Results for a subset of 182 compounds in model 3 are 91.5% and 94.4% for the training and prediction sets, respectively. Creating smaller, more topologically similar data sets result in improved classification rates.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call