Future Medicinal ChemistryVol. 2, No. 6 EditorialFree AccessRole of open chemical data in aiding drug discovery and designAnna Gaulton and John P OveringtonAnna Gaulton† Author for correspondenceEMBL – European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. Search for more papers by this authorEmail the corresponding author at anna.gaulton@ebi.ac.uk and John P OveringtonEMBL – European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UKSearch for more papers by this authorPublished Online:14 Jun 2010https://doi.org/10.4155/fmc.10.191AboutSectionsPDF/EPUB ToolsAdd to favoritesDownload CitationsTrack CitationsPermissionsReprints ShareShare onFacebookTwitterLinkedInRedditEmail Drug-discovery dataResearchers in large pharmaceutical companies typically draw on a wide range of data resources and tools to enable decisions regarding target selection, lead identification, optimization and candidate selection. Much of this information is either generated internally or licensed from commercial vendors. For example, access to large sets of screening results, patent databases and databases of clinical candidates can be used to identify chemical tools or leads for a target of interest or to assess competitive position. Additional data classes add incrementally to this view. For example, internally generated crystal structures, complexed with drug-like ligands, provide valuable information for structure-based drug design and lead optimization. Large numbers of absorption, distribution, metabolism and excretion (ADME) and toxicity measurements also allow the building of predictive models to prioritize compounds, select the best candidates for further development and attempt to minimize the risks of potential adverse effects. By contrast, academic researchers have typically had to rely on a far smaller number of available public-domain resources, together with information scattered across the literature. Access to large chemical and pharmacological datasets has previously been limited, in part due to concerns about the potential loss of intellectual property associated with disclosing compound structures.Public data & precompetitive initiativesIn recent years, however, there has been a significant increase in the availability of large-scale open data for drug discovery. In particular, the number and size of screening databases has expanded significantly. The establishment of initiatives such as the NIH Molecular Libraries Program [1] and the Broad Institute’s Chemical Biology Platform are making access to high-throughput screening (HTS) capabilities and the subsequent primary data more widely available to academic groups. These data are typically fed into public databases such as PubChem [2] and ChemBank [3]. Initial plans are also underway for a similar European infrastructure project (EU-openscreen), whose aim will be to connect a network of screening centres across Europe and provide access to the results via a common European chemical biology database [101]. In addition, the recent transfer of the ChEMBL database from the private sector into the public domain [102] will supplement existing activity databases such as BindingDB [4], IUPHARDB [5] and PDSP Ki [103]. Publishers could also play a part in this data-accessibility process by setting policies for deposition of screening data into public repositories (as is currently the case for sequence and protein structure data) and helping to standardize the way such data are reported. Nature Chemical Biology, for example, has already produced guidelines for the submission of screening data [6]. In the area of toxicity, several public screening initiatives are also underway, including the EPA ToxCast [7,8] and Tox21 [104] projects. While these efforts are primarily focused around environmental chemicals, the resulting data may still be informative in a drug-discovery context.In addition to screening and bioactivity information, there are also now an increasing number of large chemical structure repositories, providing access to tens of millions of compounds for applications such as virtual screening [9] (e.g., PubChem, Zinc [10] and GDB-13 [11]). Several other public domain databases containing drug discovery-relevant information are also being developed – for example, DrugBank [12] and DailyMed [105] provide information regarding approved drugs, ClinicalTrials.gov [106] provides data on clinical-stage experimental drugs and DSSTox [13,14] and TOXNET [15] collate toxicity information from a wide range of public sources.The increasing availability of public data coincides with initiatives in the pharmaceutical industry aimed at reducing costs, for example via increased outsourcing and engaging in precompetitive activities. The establishment of the Pistoia Alliance (a not-for-profit consortium of pharmaceutical companies, institutes and technology vendors, established for the purpose of brokering common precompetitive needs [16]) and the European Innovative Medicines Initiative [17] are both helping to provide a driving force towards further development and integration of tools and databases within the public domain. Public–private partnerships, such as the Structural Genomics Consortium-led chemical probes initiative [18], are becoming increasingly common and, further to this, pharmaceutical companies are starting to release some of their own formerly proprietary data. GlaxoSmithKline, for instance, has recently announced that it will make a large dataset of 13,500 compounds with antimalarial activity publicly available [107]. It is expected that other companies will follow this lead.The impact of open dataThe availability of public large-scale datasets is likely to have a significant impact on academic, not-for-profit and industrial drug discovery. First, groups will be enabled with access to the data they need for individual projects, for example rapid identification of high-quality tool compounds to help validate targets or profile disease models. Second, and perhaps more important, the datasets will encourage the development of new tools and predictive algorithms within the public domain, benefiting the widest possible community. A parallel to this can perhaps be seen when considering the vast array of bioinformatics tools and methods developed for functional annotation of proteins following the exponential growth in deposition of sequence and structure data since the early 1990s. A similar explosion and investment of funding in chemoinformatics and computational chemical biology research may help address many of the unmet needs in drug discovery and design. For example, databases of launched drugs and medicinal chemistry compounds could be data mined to discover key properties and rules related to successful drugs or to identify possible lead-optimization strategies and tactics. Large bioactivity datasets can be used to derive panels of quantitative structure–activity relationship or classification models, allowing prediction of compound activity from structure. Such predictions can contribute to the elucidation of the molecular targets of phenotypic assays, prediction or explanation of drug side effects and identification of potential drug repurposing opportunities through optimization of alternative activities. Identification of new leads may also be accomplished through the application of structure-based virtual-screening methods such as docking and pharmacophore- or molecular similarity-based methods.However, with all predictive methods, the quality and relevance of the training data are paramount in determining the accuracy and applicability domain of resulting models. HTS results are often uncurated and typically have a relatively high false-positive rate, for example. Dose response studies in published literature do not always adequately report negative results. Chemical structures may often be depicted or named incorrectly. As datasets become more readily available, we will see the emphasis move towards quality, in addition to indexing and organization of data, rather than raw quantity. Indeed, many analyses are already being published that assess the quality of public screening libraries and identify promiscuous or reactive compounds that could be responsible for many of the false-positive results [19,20] or investigate the accuracy of compound structures in various repositories [21]. Progress within just this one area will have a profound impact on improving the discovery rate of genuinely useful chemical probes as a starting point for the development of novel and safe therapeutics. With the increasingly rapid growth of these public-domain sources, ensuring quality and interoperability is going to pose significant challenges.Accompanying the growth of open data and associated research activities, we are also starting to see increasing growth in the availability of open-source tools for chemical data processing and analysis. For example, toolkits and workflow tools such as CDK/Taverna [22], Bioclipse [23], RDKit [108], KNIME [24] and OpenBabel [109] are gaining in popularity, allowing scientists to tap into the increasing number of available resources and facilitating data-mining efforts, without needing investment in expensive commercial software – this mirrors projects such as BioPerl for the bioinformatics research community. Similarly, efforts are underway to better integrate disparate chemical and drug-discovery data sources [25,26] and improve interoperability through the development of standards (e.g., the use of the InChI representation for chemical structures [27]). Further emphasis in this area will be essential to promote maximal utility of the data.The changing face of drug discoveryPerhaps a logical extension of many of the developments discussed above is in acting as a catalyst for the collaboration of different groups and organizations on the actual process of drug discovery. While in most areas this poses questions around retention of intellectual property, several collaborative efforts are already underway in the area of neglected disease research. Not-for-profit organizations such as the Medicines for Malaria Venture and the Drugs for Neglected Diseases initiative have already been established for this purpose and a growing number of public collaborative drug-discovery resources are being established (e.g., the TDR Targets database [28] and The Synaptic Leap [110]).In order for collaborative and academic drug-discovery efforts to really succeed, however, researchers will need access to the full range of tools and data available to those in industry. While this is becoming increasingly possible, datasets in some areas are still lacking. There is still only a limited amount of public information regarding the ADME properties of compounds, for example [29]. Without such data and the development of good-quality ADME models, potential lead compounds may lack the properties required for good bioavailability in vivo and may subsequently fail in early development. The pharmaceutical industry has also invested much time and money into identifying and eliminating causes of toxicity but, again, much of this information is not publicly available, meaning mistakes of the past risk being repeated. Finally, a large body of chemical structure, synthesis and pharmacology information is contained only within patent documents. Though these documents are readily available online, they are not in a suitably structured form for large-scale searching and analysis. Some efforts are underway to facilitate indexing of these documents. For example, OSRA is an open-source tool for conversion of graphical representations of compounds in documents into computer-readable formats, allowing images in patents to be extracted and searched by structure [30]. However, the extraction of other valuable data from patent texts remains a nontrivial task. Arguably, tackling this data-accessibility gap within the public domain could result in huge benefits in productivity and efficiency.Future perspectiveFormerly, the billions of dollars spent annually on research within the pharmaceutical industry provided industrial researchers with unparalleled access to critical tools and resources that were largely beyond the reach of academics, not-for-profits and SMEs. However, it is now becoming clear that this business model of drug-discovery research and development is not sustainable or cost effective [31], and we are seeing the drug-discovery industry, together with data publishers and funding agencies, adopt new business models based on increased outsourcing, collaborative skills transfer and precompetitive activities [32,33]. Ultimately, as the volume and quality of open data increase, we are likely to see a growth in enabled academic and collaborative drug discovery. There is also likely to be an increase in the number of small biotechnology/pharmaceutical companies, accompanied by a decrease in the amount of research carried out within the closed walls of large pharmaceutical companies; this trend will depend crucially on facile access to enabling data. Hopefully, a benefit of this change in model will be greater levels of innovation and a boost to the dwindling productivity of the drug-discovery industry as a whole.AcknowledgementsThe authors wish to thank the Wellcome Trust for a Strategic Award and the EMBL-EBI for additional support. We are grateful to the referees of this paper for their suggestions and improvements.Financial & competing interests disclosureThe authors have no relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending, or royalties.No writing assistance was utilized in the production of this manuscript.Papers of special note have been highlighted as:▪ of interest ▪▪ of considerable interestBibliography1 Austin CP, Brady LS, Insel TR, Collins FS. NIH Molecular Libraries Initiative. Science306(5699),1138–1139 (2004).Crossref, Medline, CAS, Google Scholar2 Wang Y, Bolton E, Dracheva S et al. An overview of the PubChem BioAssay resource. Nucleic Acids Res.38(Database issue),D255–266 (2010).Crossref, Medline, CAS, Google Scholar3 Seiler KP, George GA, Happ MP et al. ChemBank: a small-molecule screening and cheminformatics resource database. Nucleic Acids Res.36(Database issue),D351–D359 (2008).Crossref, Medline, CAS, Google Scholar4 Liu T, Lin Y, Wen X, Jorrisen RN, Gilson MK. BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic Acids Res.35(Database issue),D198–201 (2007).Crossref, Medline, CAS, Google Scholar5 Harmar AJ, Hills RA, Rosser EM et al. IUPHAR-DB: the IUPHAR database of G protein-coupled receptors and ion channels. Nucleic Acids Res.37(Database issue),D680–685 (2009).Crossref, Medline, CAS, Google Scholar6 Inglese J, Shamu C, Gu R. Reporting data from high-throughput screening of small-molecule libraries. Nat. Chem. Biol.3(8),438–441 (2007).▪▪ Important article calling for journals to enforce standards for the reporting of screening data.Crossref, Medline, CAS, Google Scholar7 Judson RS, Houck KA, Kavlock RJ et al.In vitro screening of environmental chemicals for targeted testing prioritization: the ToxCast project. Environ. Health Perspect.118(4),485–492 (2010).Crossref, Medline, CAS, Google Scholar8 Collins FS, Gray GM, Bucher JR. Toxicology. Transforming environmental health protection. Science319(5865),906–907.Crossref, Medline, Google Scholar9 Villoutreix BO, Renault N, Lagorce D, Sperandio O, Montes M, Miteva MA. Free resources to assist structure-based virtual ligand screening experiments. Curr. Protein Pept. Sci.8(4),381–411 (2007).Crossref, Medline, CAS, Google Scholar10 Irwin JJ, Shoichet BK. ZINC – a free database of commercially available compounds for virtual screening. J. Chem. Inf. Model.45(1),177–182 (2005).Crossref, Medline, CAS, Google Scholar11 Blum LC, Reymond JL. 970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. J. Am. Chem. Soc.131(25),8732–8733 (2009).Crossref, Medline, CAS, Google Scholar12 Wishart DS, Knox C, Guo AC et al. DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res.36(Database issue),D901–D906 (2008).Crossref, Medline, CAS, Google Scholar13 Richard AM. DSSTox website launch: improving public access to databases for building structure-toxicity prediction models. Preclinica2,103–108 (2004).CAS, Google Scholar14 Richard AM, Williams CR. Distributed structure-searchable toxicity (DSSTox) public database network: a proposal. Mutat. Res.499(1),27–52 (2002).Crossref, Medline, CAS, Google Scholar15 Hochstein C, Arnesen S, Goshorn J. Environmental health and toxicology resources of the United States National Library of Medicine. Med. Ref. Serv. Q.26(3),21–45 (2007)Crossref, Medline, Google Scholar16 Barnes MR, Harland L, Foord SM et al. Lowering industry firewalls: pre-competitive informatics initiatives in drug discovery. Nat. Rev. Drug Discov.8(9),701–708 (2009).▪ Important paper describing the aims of pharmaceutical companies in setting up precompetitive initiatives.Crossref, Medline, CAS, Google Scholar17 Hunter AJ. The Innovative Medicines Initiative: a pre-competitive initiative to enhance the biomedical science base of Europe to expedite the development of new medicines for patients. Drug Discov. Today13(9–10),371–373 (2008).Crossref, Medline, Google Scholar18 Edwards AM, Bountra C, Kerr DJ, Willson TM. Open access chemical and clinical probes to support drug discovery. Nat. Chem. Biol.5(7),436–440 (2009).▪ Details an important public–private partnership to develop freely available chemical probes for key targets.Crossref, Medline, CAS, Google Scholar19 Feng BY, Simeonov A, Jadhav A et al. A high-throughput screen for aggregation-based inhibition in a large compound library. J. Med. Chem.50(10),2385–2390 (2007).Crossref, Medline, CAS, Google Scholar20 Soares KM, Blackmon N, Shun TY et al. Profiling the NIH Small Molecule Repository for compounds that generate H2O2 by redox cycling in reducing environments. Assay Drug Dev. Technol. (2010) in press.Medline, Google Scholar21 Young D, Martin T, Venkatapathy R, Harten P. Are the chemical structures in your QSAR correct? QSAR Comb. Sci.27(11–12),1337–1345 (2008).▪ Informative article highlighting issues with data quality when building quantitative structure–activity relationship models.Crossref, CAS, Google Scholar22 Kuhn T, Willighagen EL, Zielesny A, Steinbeck C. CDK-Taverna: an open workflow environment for cheminformatics. BMC Bioinformatics11(1),159 (2010).Crossref, Medline, Google Scholar23 Spjuth O, Helmus T, Willighagen EL et al. Bioclipse: an open source workbench for chemo- and bioinformatics. BMC Bioinformatics8,59 (2007).Crossref, Medline, Google Scholar24 Berthold MR, Cebron N, Dill F et al. KNIME: The Konstanz Information Miner. In: Data Analysis, Machine Learning and Applications. Preisach C, Schmidt-Thieme L (Eds). Springer-Verlag, Berlin, 319–326 (2008).Google Scholar25 Jentzsch A, Hassanzadeh O, Bizer C, Andersson B, Stephens S. Enabling tailored therapeutics with linked data. Presented at: The 2nd Workshop about Linked Data on the Web. Madrid, Spain, 20 April 2009.Google Scholar26 Belleau F, Nolin MA, Tourigny N, Rigault P, Morissette J. Bio2RDF: towards a mashup to build bioinformatics knowledge systems. J. Biomed. Inform.41(5),706–716 (2008).Crossref, Medline, Google Scholar27 Heller SR, McNaught AD. The IUPAC international chemical identifier (InChI). Chem. Int.31(1),7 (2009).CAS, Google Scholar28 Agüero F, Al-Lazikani B, Aslett M et al. Genomic-scale prioritization of drug targets: the TDR targets database. Nat. Rev. Drug Discov.7(11),900–907 (2008).Crossref, Medline, CAS, Google Scholar29 Ekins S, Williams AJ. Precompetitive preclinical ADME/Tox data: set it free on the web to facilitate computational model building and assist drug development. Lab Chip10(1),13–22 (2010).▪ Thorough discussion of issues with the availability of absorption, distribution, metabolism, excretion and toxicity data in the public domain and the potential advantages of releasing such data.Crossref, Medline, CAS, Google Scholar30 Filippov IV, Nicklaus MC. Optical structure recognition software to recover chemical information: OSRA, an open source solution. J. Chem. Inf. Model.49(3),740–743 (2009).Crossref, Medline, CAS, Google Scholar31 Munos B. Lessons for 60 years of pharmaceutical innovation. Nat. Rev. Drug Discov.8(12),959–968 (2009).▪▪ Interesting and detailed analysis of trends in the productivity of the pharmaceutical industry throughout its historyCrossref, Medline, CAS, Google Scholar32 Melese T, Lin SM, Chang JL, Cohen NH. Open innovation networks between academia and industry: an imperative for breakthrough therapies. Nat. Med.15(5),502–507 (2009).Crossref, Medline, CAS, Google Scholar33 Munos BH, Chin WW. A call for sharing: adapting pharmaceutical research to new realities. Sci. Transl. Med.1(9),9 (2009).Crossref, Google Scholar101 EU OpenScreen. www.eu-openscreen.deGoogle Scholar102 Wellcome Trust press release www.wellcome.ac.uk/News/Media-office/Press-releases/2010/WTX058219.htmGoogle Scholar103 PDSP Database http://pdsp.med.unc.edu/pdsp.phpGoogle Scholar104 Tox21: Putting a lens on the vision of toxicity testing in the 21st Century www.alttox.org/ttrc/overarching-challenges/way-forward/austin-kavlock-ticeGoogle Scholar105 DailyMed http://dailymed.nlm.nih.gov/dailymed/about.cfmGoogle Scholar106 Clinical Trials homepage www.clinicaltrials.govGoogle Scholar107 GSK announces ‘open innovation’ strategy to help deliver new and better medicines for people living in the world’s poorest countries – press release www.gsk.com/media/pressreleases/2010/2010_pressrelease_10009.htmGoogle Scholar108 RDKit: cheminformatics and machine learning software www.rdkit.org/Google Scholar109 Open Babel: the open source toolbox http://openbabel.orgGoogle Scholar110 The Synaptic Leap Homepage www.thesynapticleap.orgGoogle ScholarFiguresReferencesRelatedDetailsCited ByDECIMER 1.0: deep learning for chemical image recognition using transformers17 August 2021 | Journal of Cheminformatics, Vol. 13, No. 1Advancing computer-aided drug discovery (CADD) by big data and data-driven machine learning modelingDrug Discovery Today, Vol. 25, No. 9Molecular Structure Extraction from Documents Using Deep Learning13 February 2019 | Journal of Chemical Information and Modeling, Vol. 59, No. 3Patterns of database citation in articles and patents indicate long-term scientific and industry value of biological data resources11 February 2016 | F1000Research, Vol. 5Finding the right approach to big data-driven medicinal chemistryScott J Lusher & Tina Ritschel6 July 2015 | Future Medicinal Chemistry, Vol. 7, No. 10Markov Logic Networks for Optical Chemical Structure Recognition6 August 2014 | Journal of Chemical Information and Modeling, Vol. 54, No. 8Data-driven medicinal chemistry in the era of big dataDrug Discovery Today, Vol. 19, No. 7Open Innovation-Based Drug Discovery in Europe: Some Examples of National and Transnational European Initiatives Integrating Chemistry, Biology, and Technology Platforms4 April 2014Public Domain Databases for Medicinal Chemistry30 September 2013The promiscuous binding of pharmaceutical drugs and their transporter-mediated uptake into cells: what we (need to) know and how we can do soDrug Discovery Today, Vol. 18, No. 5-6Public Domain Databases for Medicinal Chemistry11 July 2012 | Journal of Medicinal Chemistry, Vol. 55, No. 16Taking Open Innovation to the Molecular Level - Strengths and Limitations7 August 2012 | Molecular Informatics, Vol. 31, No. 8Annotating Human P-Glycoprotein Bioassay Data7 August 2012 | Molecular Informatics, Vol. 31, No. 8Drug discovery in the age of systems biology: the rise of computational approaches for data integrationCurrent Opinion in Biotechnology, Vol. 23, No. 4TDR Targets: a chemogenomics resource for neglected diseases23 November 2011 | Nucleic Acids Research, Vol. 40, No. D1Collation and data-mining of literature bioactivity data for drug discovery21 September 2011 | Biochemical Society Transactions, Vol. 39, No. 5Missing Value Estimation for Compound-Target Activity Data8 October 2010 | Molecular Informatics, Vol. 29, No. 10 Vol. 2, No. 6 Follow us on social media for the latest updates Metrics History Published online 14 June 2010 Published in print June 2010 Information© Future Science LtdAcknowledgementsThe authors wish to thank the Wellcome Trust for a Strategic Award and the EMBL-EBI for additional support. We are grateful to the referees of this paper for their suggestions and improvements.Financial & competing interests disclosureThe authors have no relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending, or royalties.No writing assistance was utilized in the production of this manuscript.PDF download