Character-Based Machine Learning vs. Language Modeling for Diacritics Restoration

Jurgita Kapočiūtė-Dzikienė,Aušra Vidugirienė,Andrius Davidsonas

doi:10.5755/j01.itc.46.4.18066

Jurgita Kapočiūtė-Dzikienė, Aušra Vidugirienė + Show 1 more

Open Access

https://doi.org/10.5755/j01.itc.46.4.18066

Copy DOI

Abstract

In this research we compare two approaches, in particular, character-based machine learning and language-modeling and offer the best solution for the diacritization problem solving. Parameters of tested approaches (i.e., a huge variety of feature types for the character-based method and a value n for the n-gram language-modeling method) were tuned to achieve the highest possible accuracy. Despite the main focus is on the Lithuanian language, we posit that obtained findings can also be applied to other, similar (Latvian or Slavic) languages. During experiments we measured the performance of approaches on 10 domains (including normative texts and non-normative Internet comments). The best results reaching ~99.5% and ~98.4% of the accuracy on characters and words, respectively, were achieved with the tri-gram language modeling method. It outperformed the character-based machine learning approach with an optimal composed feature set by ~1.4% and ~3.8% of the accuracy on characters and words, respectively.DOI: http://dx.doi.org/10.5755/j01.itc.46.4.18066

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Character-Based Machine Learning vs. Language Modeling for Diacritics Restoration

Abstract

Talk to us

Similar Papers

More From: Information Technology And Control

Lead the way for us

Journal: Information Technology And Control	Publication Date: Dec 14, 2017
Citations: 5

Similar Papers

The Anonymous Catechism of 1605: Slavic Loanwords and Hybrids
Anželika Smetonienė
Lietuvių kalba | VOL. -
Anželika SmetonienėAnželika Smetonienė
30 Dec 2021
The Anonymous Catechism of 1605: Slavic Loanwords and Hybrids
Anželika Smetonienė

Are AI language models such as ChatGPT ready to improve the care of individuals with epilepsy?
Christian M Boßelmann ... Dennis Lal
Epilepsia | VOL. 64
Christian M Boßelmann, et. al.Christian M Boßelmann ... Dennis Lal
13 Mar 2023
Epilepsia | VOL. 64

Natural Language Processing Pretraining Language Model for Computer Intelligent Recognition Technology
Jun Dong
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 23
Jun DongJun Dong
07 Aug 2024
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 23

Adverbialized individual words and adverbs without suffixes in Lithuanian slang and non-normative language and their adaptative features
Robertas Kudirka
Lietuvių kalba | VOL. -
Robertas KudirkaRobertas Kudirka
10 Jun 2020
Lietuvių kalba | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Character-Based Machine Learning vs. Language Modeling for Diacritics Restoration

Abstract

Talk to us

Similar Papers

More From: Information Technology And Control