Abstract

We publish a dataset containing more than 40’000 manually annotated references from a broad corpus of books and journal articles on the history of Venice. References were considered from both reference lists and footnotes, include primary and secondary sources, in full or abbreviated form. The dataset comprises references from publications from the 19th to the 21st century. References were collected from a newly digitized corpus and manually annotated in all their constituent parts. The dataset is stored on a GitHub repository, persisted in Zenodo, and it is accompanied with code to train parsers in order to extract references from other publications. Two trained Conditional Random Fields models are provided along with their evaluation, in order to act as a baseline for a parsing shared task. No comparable public dataset exists to support the task of reference parsing in the humanities. The dataset is of interest to all working on the domain of reference parsing and citation extraction in the humanities. Funding Statement: The project is supported by the Swiss National Fund, with grants 205121_159961 and P1ELP2_168489.

Highlights

  • Context Citation indexes, such as Google Scholar, the Web of Science and Scopus, are one of the main literature retrieval tools available to modern scholars

  • The disciplines traditionally part of the humanities are still poorly covered by citation indexes of any sort [8], something that both hinders the work of humanists and the understanding of the humanities as scholarly disciplines [1], not to mention their evaluation [4]

  • A key aspect of the problem is the lack of citation data, especially for local publications not in English, and for non-article publication such as scholarly monographs

Read more

Summary

Giovanni Colavizza and Matteo Romanello

We publish a dataset containing more than 40’000 manually annotated references from a broad corpus of books and journal articles on the history of Venice. The availability of citation data depends on the technical challenge of reference parsing and extraction from literature in the humanities. The lack of annotated data with sufficient coverage in two critical areas: locality (of language and s­cholarly practice) and time (going backwards at least to the 19th century, when modern academic scholarship starts). These two challenges make reference parsing in the humanities not intrinsically different than for the s­ ciences, more involved. The manually annotated dataset of references released here is part of the Linked Books project, whose goal is to develop an in-depth approach to the problem of indexing humanities’ publications via citations. It is meant to contribute and encourage a better integration of datasets and technical tools in this domain

Methods
Precision Recall
Findings
Validation precision
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call