A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics.

Martin Gerlach,Francesc Font-Clos

doi:10.3390/e22010126

Martin Gerlach, Francesc Font-Clos

Open Access

https://doi.org/10.3390/e22010126

Copy DOI

Journal: Entropy	Publication Date: Jan 20, 2020
Citations: 32	License type: CC BY 4.0

Affiliation: Northwestern University, University of Milan

Abstract

The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on three different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics.

Abstract

Talk to us

Similar Papers

More From: Entropy

Lead the way for us

Similar Papers

Applications and use Cases of Multilevel Granularity for Network Traffic Classification
Faiz Zaki ... Nor Badrul Anuar
-
Faiz Zaki, et. al.Faiz Zaki ... Nor Badrul Anuar
01 Feb 2020
01 Feb 2020

Named Entities and Their Role in Creating Context Information.
Kurt Englmeier
Procedia computer science | VOL. 176
Kurt EnglmeierKurt Englmeier
01 Jan 2020
Procedia computer science | VOL. 176

Multiresolution texture analysis for human oocyte cytoplasm description
Laura Caponetti ... Gianluca Sforza
-
Laura Caponetti, et. al.Laura Caponetti ... Gianluca Sforza
01 May 2009
01 May 2009

Sometimes “Tomorrow” is “Sometime”
José Luiz Fiadeiro ... Tom Maibaum
-
José Luiz Fiadeiro, et. al.José Luiz Fiadeiro ... Tom Maibaum
01 Jan 1993
01 Jan 1993

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics.

Abstract

Talk to us

Similar Papers

More From: Entropy