Light Diacritic Restoration to Disambiguate Homographs in Modern Arabic Texts

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Diacritic restoration (also known as diacritization or vowelization) is the process of inserting the correct diacritical markings into a text. Modern Arabic is typically written without diacritics, e.g., newspapers. This lack of diacritical markings often causes ambiguity, and though natives are adept at resolving, there are times they may fail. Diacritic restoration is a classical problem in computer science. Still, as most of the works tackle the full (heavy) diacritization of text, we, however, are interested in diacritizing the text using a fewer number of diacritics. Studies have shown that a fully diacritized text is visually displeasing and slows down the reading. This article proposes a system to diacritize homographs using the least number of diacritics, thus the name “light.” There is a large class of words that fall under the homograph category, and we will be dealing with the class of words that share the spelling but not the meaning. With fewer diacritics, we do not expect any effect on reading speed, while eye strain is reduced. The system contains morphological analyzer and context similarities. The morphological analyzer is used to generate all word candidates for diacritics. Then, through a statistical approach and context similarities, we resolve the homographs. Experimentally, the system shows very promising results, and our best accuracy is 85.6%.

Similar Papers
  • Conference Article
  • Cite Count Icon 16
  • 10.18653/v1/2020.acl-main.732
A Multitask Learning Approach for Diacritic Restoration
  • Jan 1, 2020
  • Sawsan Alqahtani + 2 more

In many languages like Arabic, diacritics are used to specify pronunciations as well as meanings. Such diacritics are often omitted in written text, increasing the number of possible pronunciations and meanings for a word. This results in a more ambiguous text making computational processing on such text more difficult. Diacritic restoration is the task of restoring missing diacritics in the written text. Most state-of-the-art diacritic restoration models are built on character level information which helps generalize the model to unseen data, but presumably lose useful information at the word level. Thus, to compensate for this loss, we investigate the use of multi-task learning to jointly optimize diacritic restoration with related NLP problems namely word segmentation, part-of-speech tagging, and syntactic diacritization. We use Arabic as a case study since it has sufficient data resources for tasks that we consider in our joint modeling. Our joint models significantly outperform the baselines and are comparable to the state-of-the-art models that are more complex relying on morphological analyzers and/or a lot more data (e.g. dialectal data).

  • Research Article
  • Cite Count Icon 87
  • 10.1017/s1351324913000284
A survey of automatic Arabic diacritization techniques
  • Oct 10, 2013
  • Natural Language Engineering
  • Aqil M Azmi + 1 more

In Modern Standard Arabic texts are typically written without diacritical markings. The diacritics are important to clarify the sense and meaning of words. Lack of these markings may lead to ambiguity even for the natives. Often the natives successfully disambiguate the meaning through the context; however, many Arabic applications, such as machine translation, text-to-speech, and information retrieval, are vulnerable due to lack of diacritics. The process of automatically restoring diacritical marks is called diacritization or diacritic restoration. In this paper we discuss the properties of the Arabic language and the issues that are related to the lack of the diacritical marking. It will be followed by a survey of the recent algorithms that were developed to solve the diacritization problem. We also look into the future trend for researchers working in this area.

  • Book Chapter
  • Cite Count Icon 3
  • 10.1093/acrefore/9780199384655.013.606
Parts of Speech, Lexical Categories, and Word Classes in Morphology
  • Jan 30, 2020
  • Oxford Research Encyclopedia of Linguistics
  • Jaklin Kornfilt

The term “part of speech” is a traditional one that has been in use since grammars of Classical Greek (e.g., Dionysius Thrax) and Latin were compiled; for all practical purposes, it is synonymous with the term “word class.” The term refers to a system of word classes, whereby class membership depends on similar syntactic distribution and morphological similarity (as well as, in a limited fashion, on similarity in meaning—a point to which we shall return). By “morphological similarity,” reference is made to functional morphemes that are part of words belonging to the same word class. Some examples for both criteria follow: The fact that in English, nouns can be preceded by a determiner such as an article (e.g., a book, the apple) illustrates syntactic distribution. Morphological similarity among members of a given word class can be illustrated by the many adverbs in English that are derived by attaching the suffix –ly, that is, a functional morpheme, to an adjective (quick, quick-ly). A morphological test for nouns in English and many other languages is whether they can bear plural morphemes. Verbs can bear morphology for tense, aspect, and mood, as well as voice morphemes such as passive, causative, or reflexive, that is, morphemes that alter the argument structure of the verbal root. Adjectives typically co-occur with either bound or free morphemes that function as comparative and superlative markers. Syntactically, they modify nouns, while adverbs modify word classes that are not nouns—for example, verbs and adjectives. Most traditional and descriptive approaches to parts of speech draw a distinction between major and minor word classes. The four parts of speech just mentioned—nouns, verbs, adjectives, and adverbs—constitute the major word classes, while a number of others, for example, adpositions, pronouns, conjunctions, determiners, and interjections, make up the minor word classes. Under some approaches, pronouns are included in the class of nouns, as a subclass. While the minor classes are probably not universal, (most of) the major classes are. It is largely assumed that nouns, verbs, and probably also adjectives are universal parts of speech. Adverbs might not constitute a universal word class. There are technical terms that are equivalents to the terms of major versus minor word class, such as content versus function words, lexical versus functional categories, and open versus closed classes, respectively. However, these correspondences might not always be one-to-one. More recent approaches to word classes don’t recognize adverbs as belonging to the major classes; instead, adpositions are candidates for this status under some of these accounts, for example, as in Jackendoff (1977). Under some other theoretical accounts, such as Chomsky (1981) and Baker (2003), only the three word classes noun, verb, and adjective are major or lexical categories. All of the accounts just mentioned are based on binary distinctive features; however, the features used differ from each other. While Chomsky uses the two category features [N] and [V], Jackendoff uses the features [Subj] and [Obj], among others, focusing on the ability of nouns, verbs, adjectives, and adpositions to take (directly, without the help of other elements) subjects (thus characterizing verbs and nouns) or objects (thus characterizing verbs and adpositions). Baker (2003), too, uses the property of taking subjects, but attributes it only to verbs. In his approach, the distinctive feature of bearing a referential index characterizes nouns, and only those. Adjectives are characterized by the absence of both of these distinctive features. Another important issue addressed by theoretical studies on lexical categories is whether those categories are formed pre-syntactically, in a morphological component of the lexicon, or whether they are constructed in the syntax or post-syntactically. Jackendoff (1977) is an example of a lexicalist approach to lexical categories, while Marantz (1997), and Borer (2003, 2005a, 2005b, 2013) represent an account where the roots of words are category-neutral, and where their membership to a particular lexical category is determined by their local syntactic context. Baker (2003) offers an account that combines properties of both approaches: words are built in the syntax and not pre-syntactically; however, roots do have category features that are inherent to them. There are empirical phenomena, such as phrasal affixation, phrasal compounding, and suspended affixation, that strongly suggest that a post-syntactic morphological component should be allowed, whereby “syntax feeds morphology.”

  • Conference Article
  • Cite Count Icon 14
  • 10.1109/ialp.2012.18
A Pointwise Approach for Vietnamese Diacritics Restoration
  • Nov 1, 2012
  • Tuan Anh Luu + 1 more

The automatic insertion of diacritics in electronic texts is necessary for a number of languages, including French, Romanian, Croatian, Sindhi, Vietnamese, etc. When diacritics are removed from a word and the resulting string of characters is not a word, it is easy to recover the diacritics. However, sometimes the resulting string is also a word, possibly with different grammatical properties or a different meaning, and this makes recovery of the missing diacritics a difficult task for software as well as for human readers. This paper is the first to study automatic diacritic restoration in Vietnamese texts. Modern Vietnamese is a complex language with many diacritical marks, and white space does not always function as a word separator. This paper proposes a point wise approach for automatically recovering missing diacritics, using three features for classification: n-grams of syllables, n-grams of syllable types, and dictionary word features. Our experiments show that the proposed method can recover diacritics with a 94.7% accuracy rate.

  • Conference Article
  • Cite Count Icon 9
  • 10.1109/ialp.2013.30
Machine Translation Approach for Vietnamese Diacritic Restoration
  • Aug 1, 2013
  • Thi Ngoc Diep Do + 3 more

The diacritic marks exist in many languages such as French, German, Slovak, Vietnamese, etc. However for some reasons, sometime they are omitted in writing. This phenomenon may lead to the ambiguity for reader when reading a non-diacritic text. The automatic diacritic restoration problem has been proposed and resolved in several languages using the character-based approach, word-based approach, point-wise approach, etc. However, these approaches lean heavily on the linguistics information, size of training corpus and sometime they are language dependent. In this paper, a simple and effective restoration method will be presented. The machine translation approach will be used as a new solution for this problem. The restoration method has been applied for Vietnamese language, and integrated in an Android application named VIVA (Vietnamese Voice Assistant) that reads out the content of incoming text messages on mobile phone. Our experiments show that the proposed restoration method can recover diacritic marks with a 99.0% accuracy rate.

  • Research Article
  • Cite Count Icon 21
  • 10.1016/j.procs.2017.10.106
Automatic minimal diacritization of Arabic texts
  • Jan 1, 2017
  • Procedia Computer Science
  • Rehab Alnefaie + 1 more

Automatic minimal diacritization of Arabic texts

  • Research Article
  • Cite Count Icon 1
  • 10.37384/va.2020.16.102
Jaunu burtu veidošana ar diakritiskajām zīmēm latviešu valodas kā svešvalodas apguvēju tekstos
  • May 6, 2020
  • Valodu apguve: problēmas un perspektīva : zinātnisko rakstu krājums = Language Acquisition: Problems and Perspective : conference proceedings
  • Inga Kaija

A Latvian learner corpus “LaVA” is being built in the Institute of Mathematics and Computer Science, University of Latvia. The corpus includes texts written by beginner learners in the first two semesters of learning Latvian as a foreign language. The texts are written by hand and digitized afterwards in order to reduce the issues that could be caused by the necessity to learn not only writing itself but also using a foreign keyboard. One of the features that cannot be digitized is the new letters created by adding diacritical marks which are not used that way in the standard Latvian alphabet. Since one of the essential steps in learning to write in a language is learning the letters and diacritical marks of that language, this study aims to find instances of such newly made letters and to discuss the basic quantitative measures in order to define hypotheses and areas of interest for further research of such usage. Altogether 322 texts were searched, and 175 examples were found. The amount of examples found in 2nd semester texts was less than half the amount of examples found in the 1st semester texts, but the percentage of texts containing examples was higher than expected – more than 33 % in the 1st semester and almost 20 % in the 2nd semester. It leads to a conclusion that this is quite a common occurrence but also prone to reduction in the second semester. The corpus does not provide any data on later semesters so it cannot be predicted when such instances should become a rare, individual feature rather than a common one. The average amount of examples in a text is not high, though. Counting only the texts where at least one example was found, the average amount of examples per text is 2.136 in the 1st semester and 1.690 in the 2nd semester. Considering that the absolute lowest possible value here is 1, it should not be considered as a high value. Therefore, using diacritical marks to make new letters, while a common feature of the Latvian interlanguage, could be characterized as casual rather than systemic. However, that does not exclude the possibility of certain patterns in usage. The currently collected data already shows that there are some words – such as garšo, viņš, ļoti, četri – where examples were found in more than one author’s text. Examples of using unsuitable diacritical marks are also sometimes found next to letters for which said diacritical marks would be suitable. This should be explored more thoroughly using qualitative methods. The size of the corpus keeps growing; the expected size upon completion is 1000 texts. When it is reached, it would be useful to repeat the study and check whether the larger amount of data still confirms the same assumptions. The larger sample size would also allow for more detailed quantitative analysis discussing each letter, diacritical mark, placement of the diacritical mark, and metadata collected for the corpus, such as gender, native language and other spoken languages by the authors of the texts.

  • Research Article
  • Cite Count Icon 4
  • 10.1038/s41539-024-00237-7
Investigating lexical categorization in reading based on joint diagnostic and training approaches for language learners
  • Apr 10, 2024
  • npj Science of Learning
  • Benjamin Gagl + 1 more

Efficient reading is essential for societal participation, so reading proficiency is a central educational goal. Here, we use an individualized diagnostics and training framework to investigate processes in visual word recognition and evaluate its usefulness for detecting training responders. We (i) motivated a training procedure based on the Lexical Categorization Model (LCM) to introduce the framework. The LCM describes pre-lexical orthographic processing implemented in the left-ventral occipital cortex and is vital to reading. German language learners trained their lexical categorization abilities while we monitored reading speed change. In three studies, most language learners increased their reading skills. Next, we (ii) estimated, for each word, the LCM-based features and assessed each reader’s lexical categorization capabilities. Finally, we (iii) explored machine learning procedures to find the optimal feature selection and regression model to predict the benefit of the lexical categorization training for each individual. The best-performing pipeline increased reading speed from 23% in the unselected group to 43% in the machine-selected group. This selection process strongly depended on parameters associated with the LCM. Thus, training in lexical categorization can increase reading skills, and accurate computational descriptions of brain functions that allow the motivation of a training procedure combined with machine learning can be powerful for individualized reading training procedures.

  • Research Article
  • 10.47119/ijrp1001021620223303
How reading speed is affected by prism correction in exophoric patients
  • May 16, 2022
  • International Journal of Research Publications
  • Avigail Hazut + 1 more

Reading is a very crucial part of life. Good reading ability is necessary for daily tasks. People who have difficulty in reading (for any reason) can find it very frustrating throughout the day and can suffer from symptoms such as headaches, eye strain, etc. Reading speed is a factor that can indicate reading ability. Among many different factors that affect reading speed, one of the factors is the condition of exophoria. This is when the eyes tend to diverge and usually presents with difficulty to converge. When reading at a near distance the eyes must converge, making reading more difficult for people with exophoria. Measuring the eyes' deviation can be performed in different ways, giving different amounts of prisms needed to correct the exophoria and give more comfort. In this study, two methods are used (Fixation Disparity and Maddox Rod) to determine how many prisms would be necessary to prescribe and then test reading speed with each number of prisms. The results showed no significant difference in reading speed using each method to test for prisms, although subjectively it appears that there is a trend toward faster reading with prisms measured according to Fixation Disparity.

  • Research Article
  • Cite Count Icon 15
  • 10.18051/univmed.2010.v29.78-83
Accommodative insufficiency as cause of asthenopia in computer-using students
  • Aug 26, 2010
  • Husnun Amalia + 2 more

To date the use of computers is widely distributed throughout the world and the associated ocular complaints are found in 75-90% of the population of computer users. Symptoms frequently reported by computer users were eyestrain, tired eyes, irritation, redness, blurred vision, diplopia, burning of the eyes, and asthenopia (visual fatigue of the eyes). A cross-sectional study was conducted to determine the etiology of asthenopia in computer-using students. A questionnaire consisting of 15 items was used to assess symptoms experienced by the computer users. The ophthalmological examination comprised visual acuity, the Hirschberg test, near point accommodation, amplitude accommodation, near point convergence, the cover test, and the alternate cover test. A total of 99 computer science students, of whom 69.7% had asthenopia, participated in the study. The symptoms that were significantly associated with asthenopia were visual fatigue (p=0.031), heaviness in the eye (p=0.002), blurred vision (p=0.001), and headache at the temples or the back of the head (p=0.000). Refractive asthenopia was found in 95.7% of all asthenopia patients with accommodative insufficiency (AI), constituting the most frequent cause at 50.7%. The duration of computer use per day was not significantly associated with the prevalence of asthenopia (p=0.700). There was a high prevalence of asthenopia among computer science students, mostly caused by refractive asthenopia. Accommodation measurements should be performed more routinely and regularly, maybe as screening, especially in computer users.

  • Research Article
  • Cite Count Icon 3
  • 10.1145/3592603
The Impact of Arabic Diacritization on Word Embeddings
  • Jun 16, 2023
  • ACM Transactions on Asian and Low-Resource Language Information Processing
  • Mohamed Abbache + 4 more

Word embedding is used to represent words for text analysis. It plays an essential role in many Natural Language Processing (NLP) studies and has hugely contributed to the extraordinary developments in the field in the last few years. In Arabic, diacritic marks are a vital feature for the readability and understandability of the language. Current Arabic word embeddings are non-diacritized. In this article, we aim to develop and compare word embedding models based on diacritized and non-diacritized corpora to study the impact of Arabic diacritization on word embeddings. We propose evaluating the models in four different ways: clustering of the nearest words; morphological semantic analysis; part-of-speech tagging; and semantic analysis. For a better evaluation, we took the challenge to create three new datasets from scratch for the three downstream tasks. We conducted the downstream tasks with eight machine learning algorithms and two deep learning algorithms. Experimental results show that the diacritized model exhibits a better ability to capture syntactic and semantic relations and in clustering words of similar categories. Overall, the diacritized model outperforms the non-diacritized model. We obtained some more interesting findings. For example, from the morphological semantics analysis, we found that with the increase in the number of target words, the advantages of the diacritized model are also more obvious, and the diacritic marks have more significance in POS tagging than in other tasks.

  • Research Article
  • Cite Count Icon 15
  • 10.1016/j.jestch.2018.09.002
Diacritic restoration of Turkish tweets with word2vec
  • Sep 18, 2018
  • Engineering Science and Technology, an International Journal
  • Zeynep Ozer + 2 more

Diacritic restoration of Turkish tweets with word2vec

  • Research Article
  • Cite Count Icon 169
  • 10.1006/brln.1996.0043
Microstates in Language-Related Brain Potential Maps Show Noun–Verb Differences
  • May 1, 1996
  • Brain and Language
  • Thomas Koenig + 1 more

Microstates in Language-Related Brain Potential Maps Show Noun–Verb Differences

  • Abstract
  • Cite Count Icon 1
  • 10.1016/j.bpj.2012.11.2942
Instances: Incorporating Computational Scientific Thinking Advances into Education & Science Courses
  • Jan 1, 2013
  • Biophysical Journal
  • Sofya Borinskaya + 7 more

Instances: Incorporating Computational Scientific Thinking Advances into Education & Science Courses

  • Conference Article
  • 10.5339/qfarc.2018.ssahpd880
Building a Rich Lexical Resource for Standard Arabic
  • Jan 1, 2018
  • Wajdi Zaghouani + 2 more

Building a Rich Lexical Resource for Standard Arabic

Save Icon
Up Arrow
Open/Close