Abstract

A data set of 297 diverse organic compounds that cause varying degrees of chromosomal aberrations in Chinese hamster lung cells is examined. Responses of an assay are categorized as clastogenic (>10% aberrant cells) and nonclastogenic (<5% aberrant cells). Each of the compounds is represented by calculated structural descriptors that encode topological, geometric, electronic, and polar surface features. A genetic algorithm (GA) employing a k-nearest neighbor (kNN) fitness evaluator is used to iteratively search a reduced descriptor space to find small, information-rich subsets of descriptors that maximize the classification rates for clastogenic and nonclastogenic responses. To further improve modeling, a similarity measure using atom-pair descriptors is employed to create more homogeneous data subsets. Three different data sets are examined. Results for a set of 297 compounds using the GA-kNN method were 86.5% and 80.0% correct classification in the training set and prediction set, respectively. Results for a subset of 279 compounds in model 2 are 85.7% and 85.7% for the training and prediction sets, respectively. Results for a subset of 182 compounds in model 3 are 91.5% and 94.4% for the training and prediction sets, respectively. Creating smaller, more topologically similar data sets result in improved classification rates.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.