150 years of written Dutch

Jozefien Piersoul,Robbert De Troij,Freek Van De Velde

doi:10.5117/nedtaa2021.3.002.pier

150 years of written Dutch

Jozefien Piersoul, Robbert De Troij + Show 1 more

https://doi.org/10.5117/nedtaa2021.3.002.pier

Copy DOI

Journal: Nederlandse Taalkunde	Publication Date: Dec 1, 2021
Citations: 2

Affiliation: KU Leuven, Radboud University Nijmegen

#Textual Data #Corpus Of Contemporary + Show 8 more

Abstract
Full-Text
Similar Papers

Abstract

Abstract In this article, we present a new corpus spanning 163 years of written Dutch. This Dutch Corpus of Contemporary and late Modern Periodicals (Dutch C-CLAMP) comprises 47,738 part-of-speech tagged articles published in Dutch periodicals from 1837 until 1999, totaling approximately 200 million tokens in size. We explain the measures we took to overcome the shortcomings of existing corpora of historical Dutch covering the same period. We provide a detailed description of how the corpus has been compiled and enriched. Several aspects are covered: text-markup, preprocessing of the data, including foreign language recognition and spelling normalization, and the enrichment of both textual data as well as metadata of the authors of the corpus files. We also carry out two case studies to illustrate the reliability of the corpus.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Similar Papers

Paper Title

Journal

Date

Author

View more papers

More From: Nederlandse Taalkunde

Paper Title

Journal

Date

Author

View more papers

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.