Abstract
Drug discovery relies much on data processing. Virtual screening (VS) is a typical method of drug discovery that exploits chemical structures (molecules) to identify those that are likely to bind to a particular drug target. VS can be turned into either a matching or a classification problem where the quality of the data matters very much. The number of features (and their properties) and data imbalance are general problems of chemical datasets used in VS. This paper investigates how to deal with these two problems to enhance the accuracy of VS and specifically to reduce the false positive rate. On one hand, we use the synthetic minority over sampling technique (SMOTE) as a technique to balance data and on the other hand we investigate different molecular descriptors and fingerprints to serve as features. A classification approach is used to assess the performance of four chosen classifiers first individually and then by combining them. As an alternative an instance-based approach is employed to observe the effect on accuracy. Results from the classification method show that a higher accuracy and a lower false positive rate can be achieved by initially balancing the datasets using SMOTE and then classifying them. The effects of descriptors and fingerprints on accuracy and false positive rates can only be discussed for each dataset separately. Combining distance matrices of different structural fingerprints does not cause active and similar compounds to appear at the top of the dissimilarity ranking.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.