Abstract
Digital access to chemical journals resulted in a vast array of molecular information that is now available in the supplementary material files in PDF format. However, extracting this molecular information, generally from a PDF document format is a daunting task. Here we present an approach to harvest 3D molecular data from the supporting information of scientific research articles that are normally available from publisher’s resources. In order to demonstrate the feasibility of extracting truly computable molecules from PDF file formats in a fast and efficient manner, we have developed a Java based application, namely ChemEngine. This program recognizes textual patterns from the supplementary data and generates standard molecular structure data (bond matrix, atomic coordinates) that can be subjected to a multitude of computational processes automatically. The methodology has been demonstrated via several case studies on different formats of coordinates data stored in supplementary information files, wherein ChemEngine selectively harvested the atomic coordinates and interpreted them as molecules with high accuracy. The reusability of extracted molecular coordinate data was demonstrated by computing Single Point Energies that were in close agreement with the original computed data provided with the articles. It is envisaged that the methodology will enable large scale conversion of molecular information from supplementary files available in the PDF format into a collection of ready- to- compute molecular data to create an automated workflow for advanced computational processes. Software along with source codes and instructions available at https://sourceforge.net/projects/chemengine/files/?source=navbar.Graphical abstract. Electronic supplementary materialThe online version of this article (doi:10.1186/s13321-016-0175-x) contains supplementary material, which is available to authorized users.
Highlights
Harvesting chemical data from the web is a challenging task requiring several convoluted steps
A specific set of guidelines defined by the publishers to submit molecular data even in a printable document format (PDF) format, would accelerate the automatic processing and recognition of chemical data for further computational studies related to reaction modeling [1,2,3], drug-discovery [4,5,6,7] and molecular inventory management [8, 9]
There is a need for the development of tools that can bridge the gap in molecular data translation automatically and accurately from PDF format to truly computable, re-usable format without manual intervention
Summary
Harvesting chemical data from the web is a challenging task requiring several convoluted steps. The supporting information related to computational methods based research articles, describing the transition states of organic reactions is available from journal publishers’ websites containing description of computations performed with tables of results, molecular images in 3D conformations along with 3D molecular co-ordinates in a PDF format This combined data in a single file complicates the harvesting process and development of pattern recognition techniques for selectively excluding the non-atomic co-ordinate information from the pool of large collection of textual data presented as supporting material. The coordinate data and bond matrix information is used to create molecules in standard interoperability formats such as .sdf or .mol as ready to compute molecules for the convenience of the user This process avoids unnecessary generation of molecular data and laborious recomputation of already published work.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.