ChemEngine: harvesting 3D chemical structures of supplementary data from PDF files.

Muthukumarasamy Karthikeyan,Renu Vyas

doi:10.1186/s13321-016-0175-x

Abstract

Digital access to chemical journals resulted in a vast array of molecular information that is now available in the supplementary material files in PDF format. However, extracting this molecular information, generally from a PDF document format is a daunting task. Here we present an approach to harvest 3D molecular data from the supporting information of scientific research articles that are normally available from publisher’s resources. In order to demonstrate the feasibility of extracting truly computable molecules from PDF file formats in a fast and efficient manner, we have developed a Java based application, namely ChemEngine. This program recognizes textual patterns from the supplementary data and generates standard molecular structure data (bond matrix, atomic coordinates) that can be subjected to a multitude of computational processes automatically. The methodology has been demonstrated via several case studies on different formats of coordinates data stored in supplementary information files, wherein ChemEngine selectively harvested the atomic coordinates and interpreted them as molecules with high accuracy. The reusability of extracted molecular coordinate data was demonstrated by computing Single Point Energies that were in close agreement with the original computed data provided with the articles. It is envisaged that the methodology will enable large scale conversion of molecular information from supplementary files available in the PDF format into a collection of ready- to- compute molecular data to create an automated workflow for advanced computational processes. Software along with source codes and instructions available at https://sourceforge.net/projects/chemengine/files/?source=navbar.Graphical abstract. Electronic supplementary materialThe online version of this article (doi:10.1186/s13321-016-0175-x) contains supplementary material, which is available to authorized users.

Highlights

Harvesting chemical data from the web is a challenging task requiring several convoluted steps
A specific set of guidelines defined by the publishers to submit molecular data even in a printable document format (PDF) format, would accelerate the automatic processing and recognition of chemical data for further computational studies related to reaction modeling [1,2,3], drug-discovery [4,5,6,7] and molecular inventory management [8, 9]
There is a need for the development of tools that can bridge the gap in molecular data translation automatically and accurately from PDF format to truly computable, re-usable format without manual intervention

Summary

Background

Harvesting chemical data from the web is a challenging task requiring several convoluted steps. The supporting information related to computational methods based research articles, describing the transition states of organic reactions is available from journal publishers’ websites containing description of computations performed with tables of results, molecular images in 3D conformations along with 3D molecular co-ordinates in a PDF format This combined data in a single file complicates the harvesting process and development of pattern recognition techniques for selectively excluding the non-atomic co-ordinate information from the pool of large collection of textual data presented as supporting material. The coordinate data and bond matrix information is used to create molecules in standard interoperability formats such as .sdf or .mol as ready to compute molecules for the convenience of the user This process avoids unnecessary generation of molecular data and laborious recomputation of already published work.

Model Reaction

Results and discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of cheminformatics	Publication Date: Dec 1, 2016
Citations: 3	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

ChemEngine: harvesting 3D chemical structures of supplementary data from PDF files.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of cheminformatics

Lead the way for us

Similar Papers

What’s in the Box? Assessing the potential usability of four decades of thesis and dissertation supplementary files
Steven Van Tuyl
Journal of eScience Librarianship | VOL. 8
Steven Van TuylSteven Van Tuyl
28 Mar 2019
Journal of eScience Librarianship | VOL. 8

Supplementary Data 1 from Investigation of Antigen-Specific T-Cell Receptor Clusters in Human Cancers
Bo Li ... Yang-Xin Fu
-
Bo Li, et. al.Bo Li ... Yang-Xin Fu
31 Mar 2023
31 Mar 2023

Supplementary Data 1 from Investigation of Antigen-Specific T-Cell Receptor Clusters in Human Cancers
Bo Li ... Yang-Xin Fu
-
Bo Li, et. al.Bo Li ... Yang-Xin Fu
31 Mar 2023
31 Mar 2023

Decision letter: Biomarkers in a socially exchanged fluid reflect colony maturity, behavior, and distributed metabolism
Patricia J Wittkopp
-
Patricia J WittkoppPatricia J Wittkopp
01 Oct 2021
01 Oct 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

ChemEngine: harvesting 3D chemical structures of supplementary data from PDF files.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of cheminformatics