Abstract

A large amount of scientific knowledge is represented within mixed forms of natural language texts and mathematical formulae. Therefore, a collaboration of natural language processing and formula analyses, so-called mathematical language processing, is necessary to enable computers to understand and retrieve information from the documents. However, as we will show in this project, a mathematical notation can change its meaning even within the scope of a single paragraph. This flexibility makes it difficult to extract the exact meaning of a mathematical formula. In this project, we will propose a new task direction for grounding mathematical formulae. Particularly, we are addressing the widespread misconception of various research projects in mathematical information retrieval, which presume that mathematical notations have a fixed meaning within a single document. We manually annotated a long scientific paper to illustrate the task concept. Our high inter-annotator agreement shows that the task is well understood for humans. Our results indicate that it is worthwhile to grow the techniques for the proposed task to contribute to the further progress of mathematical language processing.

Highlights

  • In modern research, scientific progress is often solely shared in digital form

  • By applying Math Information Retrieval (MathIR) techniques based on natural language processing (NLP), we are able to utilize this extra knowledge of mathematical formulae to build scientific knowledge bases (KBs) (Koprucki and Tabelow, 2016), improve mathematical search engines (Aizawa et al, 2013; Davila and Zanibbi, 2017; Ohashi et al, 2016), or even convert entire scientific papers into executable formats (Kohlhase and Iancu, 2014)

  • There is a necessity of disambiguation of mathematical notation because a letter or symbol in formulae is not used in a constant single meaning in a document (Greiner-Petter et al, 2020a,b)

Read more

Summary

Introduction

Scientific progress is often solely shared in digital form. Especially in technical research fields, such as in Science, Technology, Engineering, and Mathematics (STEM), it is a crucial aspect to access data and new results in a quick and uniform way. Formulae in documents are not independent content that can be understood separately from surrounding texts For this reason, some initiative projects, e.g., the mathematical language processing (MLP) project (Pagel and Schubotz, 2014), the Mathcat project (Kristianto et al, 2014), and the Part-of-Math (POM) tagger (Youssef, 2017), have been undertaken to integrate NLP techniques into formula analysis. The grounding is procedures to identify smallest groups of letters and symbols in formulae, i.e., math words, that independently refer to a mathematical concept and associate the math words with a corresponding text description or an entry in an external KB. We checked the feasibility of the proposing task direction for the grounding For this purpose, we made a long annotated scientific paper in which all formulae are annotated with math word spans and text descriptions of the corresponding mathematical concepts. We did the annotation by multiple human annotators and calculated the inner-annotator agreements so that to confirm that our task design can be wellunderstood, at least for human beings, and can be performed without individual differences

Related Work
Grounding of Formulae
Manual Annotation for the Grounding
Targets
Annotation Procedure
Agreements and Mismatch Analyses
Analyses on the Annotation and Notable Phenomena in the Target Document
Findings
Future Work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call