Abstract

In recent years, scholarly data sets have been used for various purposes, such as paper recommendation, citation recommendation, citation context analysis, and citation context-based document summarization. The evaluation of approaches to such tasks and their applicability in real-world scenarios heavily depend on the used data set. However, existing scholarly data sets are limited in several regards. In this paper, we propose a new data set based on all publications from all scientific disciplines available on arXiv.org. Apart from providing the papers’ plain text, in-text citations were annotated via global identifiers. Furthermore, citing and cited publications were linked to the Microsoft Academic Graph, providing access to rich metadata. Our data set consists of over one million documents and 29.2 million citation contexts. The data set, which is made freely available for research purposes, not only can enhance the future evaluation of research paper-based and citation context-based approaches, but also serve as a basis for new ways to analyze in-text citations, as we show prototypically in this article.

Highlights

  • A variety of tasks use scientific paper collections to help researchers in their work

  • In addition to the paper plain text files and the references database, we provide the citation contexts of all successfully resolved references extracted to a CSV file as well as a script to create custom exports

  • The high percentages of citation links contained within the data set can be explained due to the fact, that in physics and mathematics—which make up a large part of the data set—it is common to self-archive papers on arXiv

Read more

Summary

Introduction

A variety of tasks use scientific paper collections to help researchers in their work. The evaluation of approaches developed for all these tasks as well as the actual applicability and usefulness of developed systems in real-world scenarios heavily depend on the used data set Such a data set is typically a collection of papers provided in full text, or a set of already extracted citation contexts, consisting of, for instance, 1–3 sentences each. Note that these data sets only contain the publications themselves, typically in PDF format Using such data sets for paper-based or citationbased approaches is troublesome, since one must preprocess the data (i.e., (1) extract the content without introducing too much noise, (2) specify global identifiers for cited papers, and (3) annotate citations with those identifiers). Dataset 2 (Sugiyama and Kan 2015) arXiv CS (Färber et al 2018) ACL-ARC (Bird et al 2008) ACL-AAN (Radev et al 2013)

M 100 k 90 k 11 k 18 k
Method
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call