Bootstrapping Multilingual Metadata Extraction: A Showcase in Cyrillic

Johan Krause,Igor Shapiro,Michael Färber,Tarek Saier

doi:10.18653/v1/2021.sdp-1.8

Abstract

Applications based on scholarly data are of ever increasing importance. This results in disadvantages for areas where high-quality data and compatible systems are not available, such as non-English publications. To advance the mitigation of this imbalance, we use Cyrillic script publications from the CORE collection to create a high-quality data set for metadata extraction. We utilize our data for training and evaluating sequence labeling models to extract title and author information. Retraining GROBID on our data, we observe significant improvements in terms of precision and recall and achieve even better results with a self developed model. We make our data set covering over 15,000 publications as well as our source code freely available.

Highlights

T improvements in terms of precision and recall and achieve even better results with a self developed model
We make our data set covering over 15,000 publications as well as our source code freely available
Limitations of scholarly data and approaches based thereon directly translate into disadvantages for the affected publications, in terms of, for example, discoverability and impact

Summary

Data Selection

Many large scholarly data sets exist nowa- Examination of our data at this point reveals days, most are restricted in terms of language cov- that it contains documents other than scientific paerage, language related metadata, or availability of pers, such as lecture notes, lecture schedules, and full text documents. To obtain Cyrillic script publications, we first leaves us with 15,553 papers, which form the basis filter the whole collection for the language labels for our work and the provided Cyrillic data set. To prevent having to remove large portions of the identified Cyrillic papers due to missing metadata (see previous section), we decide to focus on publications’ title and list of authors. In order to create training data for sequence labeling tasks, we obtain the JSON metadata and PDF of each of the selected publications from CORE. From the PDF, we extract the plain text contained in the first page using PDFMiner, identify the title and authors from the JSON metadata and insert labels (see Section 3.2.1 for details)

Application

GROBID Training

Data Preprocessing

Evaluation

Conclusion