ChemTables: a dataset for semantic classification on tables in chemical patents

Zenan Zhai,Dat Quoc Nguyen,Christian Druckenbrodt,Karin Verspoor,Trevor Cohn,Saber A Akhondi,Camilo Thorne

doi:10.1186/s13321-021-00568-2

Zenan Zhai, Dat Quoc Nguyen + Show 5 more

Open Access

https://doi.org/10.1186/s13321-021-00568-2

Copy DOI

Abstract

Chemical patents are a commonly used channel for disclosing novel compounds and reactions, and hence represent important resources for chemical and pharmaceutical research. Key chemical data in patents is often presented in tables. Both the number and the size of tables can be very large in patent documents. In addition, various types of information can be presented in tables in patents, including spectroscopic and physical data, or pharmacological use and effects of chemicals. Since images of Markush structures and merged cells are commonly used in these tables, their structure also shows substantial variation. This heterogeneity in content and structure of tables in chemical patents makes relevant information difficult to find. We therefore propose a new text mining task of automatically categorising tables in chemical patents based on their contents. Categorisation of tables based on the nature of their content can help to identify tables containing key information, improving the accessibility of information in patents that is highly relevant for new inventions. For developing and evaluating methods for the table classification task, we developed a new dataset, called ChemTables, which consists of 788 chemical patent tables with labels of their content type. We introduce this data set in detail. We further establish strong baselines for the table classification task in chemical patents by applying state-of-the-art neural network models developed for natural language processing, including TabNet, ResNet and Table-BERT on ChemTables. The best performing model, Table-BERT, achieves a performance of 88.66 micro-averaged F_1 score on the table classification task. The ChemTables dataset is publicly available at https://doi.org/10.17632/g7tjh7tbrj.3, subject to the CC BY NC 3.0 license. Code/models evaluated in this work are in a Github repository https://github.com/zenanz/ChemTables.

Highlights

A large number of chemical compounds are first published in patents
In addition to introducing the ChemTables data set, we provide here an empirical comparison of several strong baseline approaches to table classification using this corpus, including conventional machine learning models based on the Naïve Bayes (NB) and Support Vector Machine (SVM) algorithms, as well as neural models TabNet [17], ResNet [18] and Table-BERT [19]
We show that the tables in the ChemTables dataset are sufficient to train state-of-the-art machine learning methods

Summary

Introduction

A large number of chemical compounds are first published in patents. It takes on average one to three years for compounds disclosed in patents to appear in scientific literature [1], and only a small fraction of these compounds ever appear at all in publications. Chemical patents are an important resource for the development of information management tools to support. Chemical patents typically present novel compounds, either specifying the chemical structure of compounds in the form of an image or through their systematic chemical name in the text, for which state of the art name-tostructure tools such as OPSIN [3] and MarvinSketch [4] can be used to reliably generate the structure. To back up the invention’s claims, patents contain additional information related to these compounds— characterising them further, such as physical or spectroscopic data (Fig. 1a), information related to their. Ropivacaine hydrochloride preparation (Fig. 1b), or by exemplifying their claimed use through further information or numerical data. Numerical data of very high interest to researchers, such as novel pharmacological results, are typically presented in this structured form [5]

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Cheminformatics	Publication Date: Dec 1, 2021
Citations: 3	License type: open-access

R Discovery Prime

R Discovery Prime

ChemTables: a dataset for semantic classification on tables in chemical patents

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Cheminformatics

Lead the way for us

Similar Papers

ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents.
Jiayuan He ... Christian Druckenbrodt
Frontiers in Research Metrics and Analytics | VOL. 6
Jiayuan He, et. al.Jiayuan He ... Christian Druckenbrodt
25 Mar 2021
ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents.
Jiayuan He ... Christian Druckenbrodt

Citations, family size, opposition and the value of patent rights
Dietmar Harhoff ... Katrin Vopel
Research Policy | VOL. 32
Dietmar Harhoff, et. al.Dietmar Harhoff ... Katrin Vopel
04 Dec 2002
Research Policy | VOL. 32

‘A giant step for mankind?’—A reply
F.A Jenny
World Patent Information | VOL. 12
F.A JennyF.A Jenny
01 Jan 1990
World Patent Information | VOL. 12

Recent and current developments in handling Markush structures from chemical patents
John M Barnard ... Geoff M Downs
Journal of Cheminformatics | VOL. 4
John M Barnard, et. al.John M Barnard ... Geoff M Downs
01 May 2012
Journal of Cheminformatics | VOL. 4

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

ChemTables: a dataset for semantic classification on tables in chemical patents

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Cheminformatics