Abstract

In this study, we address the interesting task of classifying historical texts by their assumed period of writ-ing. This task is useful in digital humanity studies where many texts have unidentified publication dates.For years, the typical approach for temporal text classification was supervised using machine-learningalgorithms. These algorithms require careful feature engineering and considerable domain expertise todesign a feature extractor to transform the raw text into a feature vector from which the classifier couldlearn to classify any unseen valid input. Recently, deep learning has produced extremely promising re-sults for various tasks in natural language processing (NLP). The primary advantage of deep learning isthat human engineers did not design the feature layers, but the features were extrapolated from data witha general-purpose learning procedure. We investigated deep learning models for period classification ofhistorical texts. We compared three common models: paragraph vectors, convolutional neural networks (CNN) and recurrent neural networks (RNN), and conventional machine-learning methods. We demon-strate that the CNN and RNN models outperformed the paragraph vector model and the conventionalsupervised machine-learning algorithms. In addition, we constructed word embeddings for each timeperiod and analyzed semantic changes of word meanings over time.

Highlights

  • The aim of preserving and rendering cultural heritage more accessible motivates the digitization of historical texts in the last decade

  • We focus on neural language models for the period classification of historical texts

  • Our research focuses on the period classification of historical texts from the Responsa project1

Read more

Summary

INTRODUCTION

The aim of preserving and rendering cultural heritage more accessible motivates the digitization of historical texts in the last decade. In recent years, considerable research has been devoted to diachronic lexical resources, which comprise terms from different language periods [Borin and Forsberg, 2011, Liebeskind et al, 2013, Riedl et al, 2014] These resources are primarily used for studying language changes and supporting searches in historical domains, bridging the lexical gap between modern and ancient languages. Supervised machine-learning algorithms use the training data of the input examples with their desired output to study a function. Most conventional supervised machine-learning algorithms for the period classification of historical texts are either rule-based or corpus-based. Their efficiency depends on the prior feature engineering.

Diachronic data and tasks
The Responsa corpus and diachronic tasks
SUPERVISED MACHINE LEARNING FRAMEWORK
Conventional Machine-Learning models
Deep-Learning models
Word Embeddings
Convolutional Neural Networks
Recurrent Neural Networks
EVALUATION
Period Classification
Evaluation measures
Neural Networks Architectures
Conventional Machine-learning methods
Deep-learning methods
SEMANTIC CHANGES OF WORDS MEANING OVER TIME
Word Comparisons
Periods of Change
CONCLUSIONS AND FUTURE WORK
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call