Évaluer les méthodes de deep learning pour la segmentation des mots de textes en scripta continua en ancien francais et en latin

Thibault Clérice

doi:10.46298/jdmdh.5581

Abstract

Tokenization of modern and old Western European languages seems to be fairly simple, as it stands on the presence mostly of markers such as spaces and punctuation. However, when dealing with old sources like manuscripts written in scripta continua, antiquity epigraphy or Middle Age manuscripts, (1) such markers are mostly absent, (2) spelling variation and rich morphology make dictionary based approaches difficult. Applying convolutional encoding to characters followed by linear categorization to word-boundary or in-word-sequence is shown to be effective at tokenizing such inputs. Additionally, the software is released with a simple interface for tokenizing a corpus or generating a training set.

Highlights

Tokenization of spaceless strings is a task that is difficult for computers as compared to ”whathumanscando”
In the context of text mining of HTR or OCR output, lemmatization and tokenization of medieval western languages is quite often a pre-processing step for further research to sustain analyses such as authorship attribution, corpus linguistics or to allow full-text search 3. It must be stressed in this study that the difficulty inherent to segmentation is different for scripta continua than the one for languages such as Chinese, for which an already impressive amount of work has been done
Output of the model is a mask that needs to be applied to the input: in the mask, characters are classified either as word boundary or word content

Summary

INTRODUCTION

Tokenization of spaceless strings is a task that is difficult for computers as compared to ”whathumanscando”. In the context of text mining of HTR or OCR output, lemmatization and tokenization of medieval western languages is quite often a pre-processing step for further research to sustain analyses such as authorship attribution, corpus linguistics or to allow full-text search 3. It must be stressed in this study that the difficulty inherent to segmentation is different for scripta continua than the one for languages such as Chinese, for which an already impressive amount of work has been done. This makes a dictionary-based approach rather difficult as it would rely on a high number of different spellings, making the computation highly complex

Architecture

Main Dataset

Results

Example of Outputs

Latin Prose and Poetic Corpora

Medieval Latin corpora

Latin epigraphic corpora

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Data Mining and Digital Humanities	Publication Date: Apr 7, 2020
Citations: 4	License type: CC BY-SA 4.0

R Discovery Prime

R Discovery Prime

Évaluer les méthodes de deep learning pour la segmentation des mots de textes en scripta continua en ancien francais et en latin

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Data Mining and Digital Humanities

Lead the way for us

Similar Papers

Middle Kingdom on the Margins: The Perilous Journey of Chinese into the MLA and Other Radical Ruminations
Christopher Lupke
Comparative Literature Studies | VOL. 50
Christopher LupkeChristopher Lupke
01 May 2013
Comparative Literature Studies | VOL. 50

The languages of aphasia research: Bias and diversity
Madeleine E L Beveridge ... Thomas H Bak
Aphasiology | VOL. 25
Madeleine E L Beveridge, et. al.Madeleine E L Beveridge ... Thomas H Bak
20 Oct 2011
Aphasiology | VOL. 25

Translating Information
Rafael Capurro
-
Rafael CapurroRafael Capurro
01 Jul 2015
01 Jul 2015

THE RICHNESS OF THE LEXICAL FUND OF EUROPEAN LANGUAGES THROUGH THE PRISM OF INTERNATIONAL DIPLOMACY TERMS
I Vakulyuk
Mìžnarodnij fìlologìčnij časopis | VOL. 14
I VakulyukI Vakulyuk
01 Jan 2023
Mìžnarodnij fìlologìčnij časopis | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Évaluer les méthodes de deep learning pour la segmentation des mots de textes en scripta continua en ancien francais et en latin

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Data Mining and Digital Humanities