Abstract

Chemogenomics data generally refers to the activity data of chemical compounds on an array of protein targets and represents an important source of information for building in silico target prediction models. The increasing volume of chemogenomics data offers exciting opportunities to build models based on Big Data. Preparing a high quality data set is a vital step in realizing this goal and this work aims to compile such a comprehensive chemogenomics dataset. This dataset comprises over 70 million SAR data points from publicly available databases (PubChem and ChEMBL) including structure, target information and activity annotations. Our aspiration is to create a useful chemogenomics resource reflecting industry-scale data not only for building predictive models of in silico polypharmacology and off-target effects but also for the validation of cheminformatics approaches in general.

Highlights

  • Introduction to methodology and encoding rulesJ Chem Inf Comput Sci 28:31–3626

  • There has been a remarkable increase in the amount of available compound structure and activity relation (SAR) data, contributed mainly by the development of high throughput screening (HTS) technologies and combinatorial chemistry for compound synthesis [3]

  • These SAR data points represent an important resource for chemogenomics modelling, a computational strategy in drug discovery that investigates an interaction of a large set of compounds against families of functionally related proteins [4]

Read more

Summary

Introduction

Introduction to methodology and encoding rulesJ Chem Inf Comput Sci 28:31–3626. SMIRKS web site. http://www.daylight.com/dayhtml/doc/theory/theory. smirks.html. There has been a remarkable increase in the amount of available compound structure and activity relation (SAR) data, contributed mainly by the development of high throughput screening (HTS) technologies and combinatorial chemistry for compound synthesis [3] These SAR data points represent an important resource for chemogenomics modelling, a computational strategy in drug discovery that investigates an interaction of a large set of compounds (one or more libraries) against families of functionally related proteins [4]. ChEMBL contains data that was manually extracted from numerous peer reviewed journal articles, as do WOMBAT [9], BindingDB [6], and CARLSBAD [10] Commercial databases, such as SciFinder [11], GOSTAR [12] and Reaxys [13] have accumulated a large amount of data from publications as well as patents. Large pharmaceutical companies maintain their own data collections originating from in-house HTS screening campaigns and drug discovery projects

Objectives
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.