Diacritic Restoration Research Articles

Social media platforms such as Twitter have grown at a tremendous pace in recent years and have become an important source of data providing information countless field. This situation was of interest to researchers and many studies on machine learning and natural language processing was conducted on social media data. However, the language is used in social media contains a very high amount of noisy data than the formal writing language. In this article, we present a study on diacritic restoration which is one of the important difficulties of social media text normalization in order to reduce the noise problem. Diacritic is a set of marks used to change the sound values of letters and is used on many languages besides Turkish. We suggest a 3-step model for this study to overcome the top of the diacritic restoration problem. In the first step, a candidate word generator produces possible word forms, in the second step the language validator chooses the correct word forms and at the final Word2vec is used to create vector representations of the words and make the most appropriate word choice by using cosine similarities. The proposed method was tested on both the 2 ad-hoc created datasets and the real dataset. Studies on small ad-hoc created dataset and real dataset provided a relative error reduction of 37.8% with an average performance of 94.5%. In addition, tests on more than 6 M words on large ad-hoc created dataset yielded a serious performance with an error rate of 3.9%. Furthermore, the proposed method was tested on the binary classification problem consisting of highway traffic data in order to evaluate the effects on classification performance, and a 3.1% increase in classification performance was achieved.

Read full abstract

Diacritic Restoration is a necessity in the processing of languages with Latinbased scripts that utilizes letters outside the basic Latin alphabet used by English language. Yoruba is one such languages, marking underdot (dot-below)on three characters and tone marks on all seven vowels and two syllabic nasals. The problem of restoring underdotted characters has been fairly addressed using character as linguistic units for restoration. However, the existing characterbased approaches and word-based approach has not been able to sufficiently address restoration of tone marks in Yoruba. We address in this study tone marks restoration as a subset of diacritic restoration. We proposed using the syllable (derived from word) as the linguistic token for tone marks restoration. In our experimental setup, we used Yoruba text collected from various sources as data with total word count of 250,336 words. These words, on syllabification, yielded 464,274 syllables. The syllables were divided into training and testing data in different proportions ranging from 99% used for training and 1% used for testing to 70% used for training and 30% used for testing. The aim of evaluation different proportions was to determine how the ratio of training-to-test data affect the variations that may occur in the result. We applied Memory-based learning to train the models. We also set up a similar experiment using character token to be able to compare the performance. The result showed that using syllable was able to increase accuracy at word level to 96.23% and an average of almost 15% over that gotten from using character. We also found out that using 75% of data for training and the remaining 25% for testing gives the results with the least variation in a ten-fold cross validation test. Hybridizing the syllable „based approach with other methods like lexicon lookup might likely lead to improvement over the current result.

Read full abstract

Diacritic Restoration Research Articles

Related Topics

Articles published on Diacritic Restoration

Diacritic Restoration for Yoruba Text with under dot and Diacritic Mark Based on LSTM

Arabic Syntactic Diacritics Restoration Using BERT Models

Diacritics correction in Turkish with context-aware sequence to sequence modeling

Correcting Diacritics and Typos with a ByT5 Transformer Model

Light Diacritic Restoration to Disambiguate Homographs in Modern Arabic Texts

Diacritics Restoration using BERT with Analysis on Czech language

Diacritics restoration based on word n-grams for Slovak texts

Open Vocabulary Arabic Diacritics Restoration

Automatic Diacritics Restoration for Tunisian Dialect

Integrating Diacritics Restoration and Question Classification into Vietnamese Question Answering System

Diacritic restoration of Turkish tweets with word2vec

Character-Based Machine Learning vs. Language Modeling for Diacritics Restoration

A survey of diacritic restoration in abjad and alphabet writing systems

Instant Diacritics Restoration System for Sindhi Accent Prediction using N-Gram and Memory-Based Learning Approaches

RESTORING TONE-MARKS IN STANDARD YORÙBÁ ELECTRONIC TEXT: IMPROVED MODEL

DeASCIIfication approach to handle diacritics in Turkish information retrieval

Deep Learning Framework with Confused Sub-Set Resolution Architecture for Automatic Arabic Diacritization

A survey of automatic Arabic diacritization techniques

SMT-based ASR domain adaptation methods for under-resourced languages: Application to Romanian

Arabic diacritic restoration approach based on maximum entropy models

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Diacritic Restoration Research Articles

Related Topics

Articles published on Diacritic Restoration

Diacritic Restoration for Yoruba Text with under dot and Diacritic Mark Based on LSTM

Arabic Syntactic Diacritics Restoration Using BERT Models

Diacritics correction in Turkish with context-aware sequence to sequence modeling

Correcting Diacritics and Typos with a ByT5 Transformer Model

Light Diacritic Restoration to Disambiguate Homographs in Modern Arabic Texts

Diacritics Restoration using BERT with Analysis on Czech language

Diacritics restoration based on word n-grams for Slovak texts

Open Vocabulary Arabic Diacritics Restoration

Automatic Diacritics Restoration for Tunisian Dialect

Integrating Diacritics Restoration and Question Classification into Vietnamese Question Answering System

Diacritic restoration of Turkish tweets with word2vec

Character-Based Machine Learning vs. Language Modeling for Diacritics Restoration

A survey of diacritic restoration in abjad and alphabet writing systems

Instant Diacritics Restoration System for Sindhi Accent Prediction using N-Gram and Memory-Based Learning Approaches

RESTORING TONE-MARKS IN STANDARD YORÙBÁ ELECTRONIC TEXT: IMPROVED MODEL

DeASCIIfication approach to handle diacritics in Turkish information retrieval

Deep Learning Framework with Confused Sub-Set Resolution Architecture for Automatic Arabic Diacritization

A survey of automatic Arabic diacritization techniques

SMT-based ASR domain adaptation methods for under-resourced languages: Application to Romanian

Arabic diacritic restoration approach based on maximum entropy models