Abstract

Finding information about annotated chemical reactions for drugs and small compounds is a crucial step for pharmaceutical industries. This data often is presented in form of unstructured documents (especially patents) and manual extraction of this information is a time- and cost inefficient effort. In our project UIMA-HPC [1], we describe the combined usage of Unstructured Information Managment Architecture (UIMA) and Uniform Interface to Computing Recources (UNICORE) for large-scale chemical patent mining. Our approach will incorporate existing software such as chemoCR for image processing (image-to-structure) and OCR for text reconstruction. All components are wrapped inside the UIMA framework pipeline. Using the UIMA framework ensures compatibility between different components of the pipeline and makes it possible to connect arbitrary annotation modules into this system. Scale-out for large document collections is achieved by the UNICORE framework on High Performance Clusters, which enables parallelization of all UIMA nodes. The aim is a fully annotated pdf collection where all biomedical entities (compound names, reaction schemes, etc.) are connected by references and thus can be easily browsed and searched by the user. Planned schematic workflow is shown in Figure ​Figure11. Figure 1 Planned workflow of our UIMA framework. 'Recognition' and 'annotation' are CPU intensive parts that are parallelized on demand using the UNICORE framework. 'Merging' checks for cross-annotations (entity in text and image). Finally, an annotated PDF is ...

Highlights

  • Finding information about annotated chemical reactions for drugs and small compounds is a crucial step for pharmaceutical industries

  • In our project UIMA-HPC [1], we describe the combined usage of Unstructured Information Managment Architecture (UIMA) and Uniform Interface to Computing Recources (UNICORE) for large-scale chemical patent mining

  • All components are wrapped inside the UIMA framework pipeline

Read more

Summary

Introduction

Finding information about annotated chemical reactions for drugs and small compounds is a crucial step for pharmaceutical industries. Large scale chemical patent mining with UIMA and UNICORE Alexander Klenner1*, Sandra Bergmann2, Marc Zimmermann1, Mathilde Romberg2

Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.