A Morphology-Driven Approach to NLP for a Low-Resource, Highly Complex Language
This paper presents a study investigating the optimization of well-known NLP algorithms and approaches for the Georgian language, known for its unique linguistic features. Standard methods effective for well-resourced languages, including pretrained models like mBERT and embedding methods such as FastText, may lack flexibility and efficiency when applied to Georgian, often resulting in increased complexity and effort. To address these challenges, we propose a novel approach that leverages Georgian’s rich morphology, including case inflections, extensive suffixation, verb agreement, and conjugation patterns. This method refines algorithms such as Minimum Editing Distance, Text Classification, Language Modeling, and word-level semantic similarity by incorporating language-specific characteristics. Our approach reduces data sparsity and model complexity while preserving accuracy. Although developed for Georgian, it is also relevant for other fusional and agglutinative languages and contributes to reducing dependence on large corpora, supporting the creation of more human-like text.
- Research Article
25
- 10.1109/tasl.2011.2162323
- Jan 1, 2011
- IEEE Transactions on Audio, Speech, and Language Processing
This paper focuses on integrating linguistically motivated and statistically derived information into language modeling. We use discriminative language models (DLMs) as a complementary approach to the conventional <formula formulatype="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex Notation="TeX">$n$</tex></formula> -gram language models to benefit from discriminatively trained parameter estimates for overlapping features. In our DLM approach, relevant information is encoded as features. Feature weights are discriminatively trained using training examples and used to re-rank the <formula formulatype="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex Notation="TeX">$N$</tex> </formula> -best hypotheses of the baseline automatic speech recognition (ASR) system. In addition to presenting a more complete picture of previously proposed feature sets that extract implicit information available at lexical and sub-lexical levels using both linguistic and statistical approaches, this paper attempts to incorporate semantic information in the form of topic sensitive features. We explore linguistic features to incorporate complex morphological and syntactic language characteristics of Turkish, an agglutinative language with rich morphology, into language modeling. We also apply DLMs to our sub-lexical-based ASR system where the vocabulary is composed of sub-lexical units. Obtaining implicit linguistic information from sub-lexical hypotheses is not as straightforward as word hypotheses, so we use statistical methods to derive useful information from sub-lexical units. DLMs with linguistic and statistical features yield significant, 0.8%–1.1% absolute, improvements over our baseline word-based and sub-word-based ASR systems. The explored features can be easily extended to DLM for other languages .
- Research Article
- 10.5755/j01.itc.46.4.18367
- Dec 14, 2017
- Information Technology And Control
Statistical language modeling involves techniques and procedures that assign probabilities to word sequences or, said in other words, estimate the regularity of the language. This paper presents basic characteristics of statistical language models, reviews their use in the large set of speech and language applications, explains their formal definition and shows different types of language models. Detailed overview of n-gram and class-based models (as well as their combinations) is given chronologically, by type and complexity of models, and in aspect of their use in different NLP applications for different natural languages. The proposed experimental procedure compares three different types of statistical language models: n-gram models based on words, categorical models based on automatically determined categories and categorical models based on POS tags. In the paper, we propose a language model for contemporary Croatian texts, a procedure how to determine the best n-gram and the optimal number of categories, which leads to significant decrease of language model perplexity, estimated from the Croatian News Agency articles (HINA) corpus. Using different language models estimated from the HINA corpus, we show experimentally that models based on categories contribute to a better description of the natural language than those based on words. These findings of the proposed experiment are applicable, except for Croatian, for similar highly inflectional languages with rich morphology and non-mandatory sentence word order. DOI: http://dx.doi.org/10.5755/j01.itc.46.4.18367
- Research Article
4
- 10.1145/3663568
- Jun 21, 2024
- ACM Transactions on Asian and Low-Resource Language Information Processing
The relevance of the problem of automatic speech recognition lies in the lack of research for low-resource languages, stemming from limited training data and the necessity for new technologies to enhance efficiency and performance. The purpose of this work was to study the main aspects of integrated end-to-end speech recognition and the use of modern technologies in the natural processing of agglutinative languages, including Kazakh. In this article, the study of language models was carried out using comparative, graphic, statistical, and analytical-synthetic methods, which were used in combination. This article addresses automatic speech recognition (ASR) in agglutinative languages, particularly Kazakh, through a unified neural network model that integrates both acoustic and language modeling. Employing advanced techniques like connectionist temporal classification and attention mechanisms, the study focuses on effective speech-to-text transcription for languages with complex morphologies. Transfer learning from high-resource languages helps mitigate data scarcity in languages such as Kazakh, Kyrgyz, Uzbek, Turkish, and Azerbaijani. The research assesses model performance, underscores ASR challenges, and proposes advancements for these languages. It includes a comparative analysis of phonetic and word-formation features in agglutinative Turkic languages, using statistical data. The findings aid further research in linguistics and technology for enhancing speech recognition and synthesis, contributing to voice identification and automation processes.
- Conference Article
6
- 10.1109/seeda-cecnsm57760.2022.9932996
- Sep 23, 2022
Text categorization is a significant task in the re-search field of text mining, which has recently benefited from deep neural network algorithms and advanced learning techniques that extract language models from large textual corpora. These Pre-Trained Language Models are the main components of state-of-the-art solutions in many natural language processing and text-mining tasks can be very generic, trained in generic text corpora, or domain-specific when they employ large corpora from specific application domains (e.g. social media, news, sciences, etc.). When only generic language models are available the overall performance in the task can be improved by adapting or fine-tuning the model used for the task, e.g. the classifier. Although multilingual language models are reported in the literature, such models are usually language-specific. This work presents a news article classifier, which has been trained on a small corpus and employs a Greek version of BERT language model. Comparison with existing machine learning-based classifiers shows that the proposed method outperforms well-known methods in text classification. In addition, the proposed approach allows the continuous training of the classifier through user-provided feedback on falsely classified articles.
- Book Chapter
9
- 10.5772/6380
- Nov 1, 2008
Automatic Speech Recognition (ASR) systems utilize statistical acoustic and language models to find the most probable word sequence when the speech signal is given. Hidden Markov Models (HMMs) are used as acoustic models and language model probabilities are approximated using n-grams where the probability of a word is conditioned on n-1 previous words. The n-gram probabilities are estimated by Maximum Likelihood Estimation. One of the problems in n-gram language modeling is the data sparseness that results in non-robust probability estimates especially for rare and unseen n-grams. Therefore, smoothing is applied to produce better estimates for these n-grams. The traditional n-gram word language models are commonly used in state-of-the-art Large Vocabulary Continuous Speech Recognition (LVSCR) systems. These systems result in reasonable recognition performances for languages such as English and French. For instance, broadcast news (BN) in English can now be recognized with about ten percent word error rate (WER) (NIST, 2000) which results in mostly quite understandable text. Some rare and new words may be missing in the vocabulary but the result has proven to be sufficient for many important applications, such as browsing and retrieval of recorded speech and information retrieval from the speech (Garofolo et al., 2000). However, LVCSR attempts with similar systems in agglutinative languages, such as Finnish, Estonian, Hungarian and Turkish so far have not resulted in comparable performance to the English systems. The main reason of this performance deterioration in those languages is their rich morphological structure. In agglutinative languages, words are formed mainly by concatenation of several suffixes to the roots and together with compounding and inflections this leads to millions of different, but still frequent word forms. Therefore, it is practically impossible to build a word-based vocabulary for speech recognition in agglutinative languages that would cover all the relevant words. If words are used as language modeling units, there will be many out-of-vocabulary (OOV) words due to using limited vocabulary sizes in ASR systems. It was shown that with an optimized 60K lexicon O pe n A cc es s D at ab as e w w w .in te ch w eb .o rg
- Book Chapter
5
- 10.1007/978-3-030-86340-1_43
- Jan 1, 2021
Recently, pre-trained language models achieve extraordinary performance on numerous benchmarks. By learning the general language knowledge from a large pre-train corpus, the language models could fit for a specific downstream task with a relatively small amount of labeled training data in the fine-tuning stage. More remarkably, the GPT-3 with 175 B parameters performs well in specific tasks by leveraging natural-language prompts and few demonstrations of the task. Inspired by the success of GPT-3, we desire to know whether smaller language models could still have a similarly few-shot learning ability. Unlike the various delicately designed tasks in previous few-shot learning research works, we do it more practically. We present a question-answering-based method to help the language model better understand the text classification task by concatenating a label-related question to each candidate sentence. By leveraging the label-related language knowledge, which the language model has learned during the pre-trained stage, our QA model can outperform the traditional binary and multi-class classification approaches over both English and Chinese datasets. Afterward, we test our QA model by performing few-shot learning experiments on multiple pre-trained language models of different sizes that range from the DistilBERT to the RoBERTa-large. We are surprised to find that even the DistilBERT, which is the smallest language model we tested with only 66 M parameters, still holds undeniable few-shot learning ability. Moreover, the RoBERTa-large with 355 M parameter could achieve a remarkable high accuracy rate of 92.18% with only 100 labeled training data. This result gives people a practical guideline that when a new category of labeled data is needed, only as few as 100 data need to be labeled. Then cooperate with an appropriate pre-training model and classification algorithm, reliable classification results can be obtained. Even without any labeled training data, that is, under the zero-shot learning setup, the RoBERTa-large still achieves a solid accuracy rate of 84.84%. Our code is available at https://github.com/ZhangYunchenY/BetterFs.
- Research Article
19
- 10.1016/j.heliyon.2023.e15670
- May 1, 2023
- Heliyon
Comparison of pre-trained language models in terms of carbon emissions, time and accuracy in multi-label text classification using AutoML
- Research Article
7
- 10.1561/1500000107
- Apr 17, 2025
- Foundations and Trends® in Information Retrieval
Text classification stands as a cornerstone within the realm of Natural Language Processing (NLP), particularly when viewed through computer science and engineering. The past decade has seen deep learning revolutionize text classification, propelling advancements in text retrieval, categorization, information extraction, and summarization. The scholarly literature includes datasets, models, and evaluation criteria, with English being the predominant language of focus, despite studies involving Arabic, Chinese, Hindi, and others. The efficacy of text classification models relies heavily on their ability to capture intricate textual relationships and non-linear correlations, necessitating a comprehensive examination of the entire text classification pipeline. In the NLP domain, a plethora of text representation techniques and model architectures have emerged, with Large Language Models (LLMs) and Generative Pre-trained Transformers (GPTs) at the forefront. These models are adept at transforming extensive textual data into meaningful vector representations encapsulating semantic information. The multidisciplinary nature of text classification, encompassing data mining, linguistics, and information retrieval, highlights the importance of collaborative research to advance the field. This work integrates traditional and contemporary text mining methodologies, fostering a holistic understanding of text classification. This monograph provides an in-depth exploration of the text classification pipeline, with a particular emphasis on evaluating the impact of each component on the overall performance of text classification models. The pipeline includes state-of-the-art datasets, text preprocessing techniques, text representation methods, classification models, evaluation metrics, and future trends. Each section examines these stages, presenting technical innovations and recent findings. The work assesses various classification strategies, offering comparative analyses, examples and case studies. These contributions extend beyond a typical survey, providing a detailed and insightful exploration of the field. In several Natural Language Processing (NLP) applications like news categorization, sentiment analysis, and subject labelling, text classification is a crucial and relevant task. The goal is to tag or label textual components like sentences, questions, paragraphs, and documents. In this era of massive information dissemination, manually processing and categorizing huge amounts of text data takes a relevant amount of time and effort. Text classification stands as a cornerstone within the realm of NLP, particularly when viewed through computer science and engineering. The past decade has seen deep learning revolutionize text classification, propelling advancements in text retrieval, categorization, information extraction, and summarization. The efficacy of text classification models relies heavily on their ability to capture intricate textual relationships and non-linear correlations, necessitating a comprehensive examination of the entire text classification pipeline. This work integrates traditional and contemporary text mining methodologies, fostering a holistic understanding of text classification. In the NLP domain, numerous text representation techniques and model architectures have emerged, with Large Language Models (LLMs) and Generative pre-trained Transformers (GPTs) at the forefront. These models are adept at transforming extensive textual data into meaningful vector representations encapsulating semantic information. Text classification is multidisciplinary in nature, encompassing data mining, linguistics, and information retrieval. This monograph provides an in-depth exploration of the text classification pipeline, with a particular emphasis on evaluating the impact of each component on the overall performance of text classification models. The pipeline includes state-of-the-art datasets, text preprocessing techniques, text representation methods, classification models, evaluation metrics, and future trends. Each section examines these stages, presenting technical innovations and recent findings. The work assesses various classification strategies, offering comparative analyses, examples and case studies. These contributions extend beyond a typical survey, providing a detailed and insightful exploration of the field.
- Book Chapter
- 10.4018/978-1-59904-849-9.ch015
- Jan 1, 2009
Accdrnig to rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt tihng is that the frist and lsat ltteer be at the rghit pclae. Tihs is bcuseae the human mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.1 Unfortunately computing systems are not yet as smart as the human mind. Over the last couple of years a significant number of researchers have been focussing on noisy text analytics. Noisy text data is found in informal settings (online chat, SMS, e-mails, message boards, among others) and in text produced through automated speech recognition or optical character recognition systems. Noise can possibly degrade the performance of other information processing algorithms such as classification, clustering, summarization and information extraction. We will identify some of the key research areas for noisy text and give a brief overview of the state of the art. These areas will be, (i) classification of noisy text, (ii) correcting noisy text, (iii) information extraction from noisy text. We will cover the first one in this chapter and the later two in the next chapter. We define noise in text as any kind of difference in the surface form of an electronic text from the intended, correct or original text. We see such noisy text everyday in various forms. Each of them has unique characteristics and hence requires special handling. We introduce some such forms of noisy textual data in this section. Online Noisy Documents: E-mails, chat logs, scrapbook entries, newsgroup postings, threads in discussion fora, blogs, etc., fall under this category. People are typically less careful about the sanity of written content in such informal modes of communication. These are characterized by frequent misspellings, commonly and not so commonly used abbreviations, incomplete sentences, missing punctuations and so on. Almost always noisy documents are human interpretable, if not by everyone, at least by intended readers. SMS: Short Message Services are becoming more and more common. Language usage over SMS text significantly differs from the standard form of the language. An urge towards shorter message length facilitating faster typing and the need for semantic clarity, shape the structure of this non-standard form known as the texting language (Choudhury et. al., 2007). Text Generated by ASR Devices: ASR is the process of converting a speech signal to a sequence of words. An ASR system takes speech signal such as monologs, discussions between people, telephonic conversations, etc. as input and produces a string a words, typically not demarcated by punctuations as transcripts. An ASR system consists of an acoustic model, a language model and a decoding algorithm. The acoustic model is trained on speech data and their corresponding manual transcripts. The language model is trained on a large monolingual corpus. ASR convert audio into text by searching the acoustic model and language model space using the decoding algorithm. Most conversations at contact centers today between agents and customers are recorded. To do any processing of this data to obtain customer intelligence it is necessary to convert the audio into text. Text Generated by OCR Devices: Optical character recognition, or ‘OCR’, is a technology that allows digital images of typed or handwritten text to be transferred into an editable text document. It takes the picture of text and translates the text into Unicode or ASCII. . For handwritten optical character recognition, the rate of recognition is 80% to 90% with clean handwriting. Call Logs in Contact Centers: Today’s contact centers (also known as call centers, BPOs, KPOs) produce huge amounts of unstructured data in the form of call logs apart from emails, call transcriptions, SMS, chattranscripts etc. Agents are expected to summarize an interaction as soon as they are done with it and before picking up the next one. As the agents work under immense time pressure hence the summary logs are very poorly written and sometimes even difficult for human interpretation. Analysis of such call logs are important to identify problem areas, agent performance, evolving problems etc. In this chapter we will be focussing on automatic classification of noisy text. Automatic text classification refers to segregating documents into different topics depending on content. For example, categorizing customer emails according to topics such as billing problem, address change, product enquiry etc. It has important applications in the field of email categorization, building and maintaining web directories e.g. DMoz, spam filter, automatic call and email routing in contact center, pornographic material filter and so on.
- Research Article
- 10.54097/hset.v7i.1094
- Aug 3, 2022
- Highlights in Science, Engineering and Technology
Considering the important role text classification plays in natural language processing tasks, improving the accuracy and efficiency of text classification has been a priority in recent work. In this paper, we focus on the latest text classification methods and sort them into three categories: embedding methods, language models, and various neural networks. We summarize the state of current research and the insufficiencies which may be directions for future study.
- Conference Article
2
- 10.1109/icsda.2009.5278368
- Aug 1, 2009
In this paper, we discuss a new language model that considers the characteristics of the agglutinative languages. We used Mongolian (a Cyrillic language system used in Mongolia) as an example from which to build the language model. We developed a Multi-class N-gram language model based on similar word clustering that focuses on the variable suffixes of a word in Mongolian. By applying our proposed language model, the resulting recognition system can improve performance by 6.85% compared with a conventional word N-gram when applying the ATRASR engine. We also confirmed that our new model will be convenient for rapid development of an ASR system for resource-deficient languages, especially for agglutinative languages such as Mongolian.
- Research Article
39
- 10.1016/j.sigpro.2005.12.002
- Jan 4, 2006
- Signal Processing
A unified language model for large vocabulary continuous speech recognition of Turkish
- Conference Article
149
- 10.3115/1075096.1075147
- Jan 1, 2003
We approximate Arabic's rich morphology by a model that a word consists of a sequence of morphemes in the pattern prefix*-stem-suffix* (* denotes zero or more occurrences of a morpheme). Our method is seeded by a small manually segmented Arabic corpus and uses it to bootstrap an unsupervised algorithm to build the Arabic word segmenter from a large unsegmented Arabic corpus. The algorithm uses a trigram language model to determine the most probable morpheme sequence for a given input. The language model is initially estimated from a small manually segmented corpus of about 110,000 words. To improve the segmentation accuracy, we use an unsupervised algorithm for automatically acquiring new stems from a 155 million word unsegmented corpus, and re-estimate the model parameters with the expanded vocabulary and training corpus. The resulting Arabic word segmentation system achieves around 97% exact match accuracy on a test corpus containing 28,449 word tokens. We believe this is a state-of-the-art performance and the algorithm can be used for many highly inflected languages provided that one can create a small manually segmented corpus of the language of interest.
- Conference Article
137
- 10.3115/1220575.1220601
- Jan 1, 2005
During the last years there has been growing interest in using neural networks for language modeling. In contrast to the well known back-off n-gram language models, the neural network approach attempts to overcome the data sparseness problem by performing the estimation in a continuous space. This type of language model was mostly used for tasks for which only a very limited amount of in-domain training data is available.In this paper we present new algorithms to train a neural network language model on very large text corpora. This makes possible the use of the approach in domains where several hundreds of millions words of texts are available. The neural network language model is evaluated in a state-of-the-art real-time continuous speech recognizer for French Broadcast News. Word error reductions of 0.5% absolute are reported using only a very limited amount of additional processing time.
- Research Article
13
- 10.1080/18756891.2010.9727729
- Oct 1, 2010
- International Journal of Computational Intelligence Systems
In this paper, we investigate the document categorization task with statistical language models. Our study mainly focuses on categorization of documents in agglutinative languages. Due to the productive morphology of agglutinative languages, the number of word forms encountered in naturally occurring text is very large. From the language modeling perspective, a large vocabulary results in serious data sparseness problems. In order to cope with this drawback, previous studies in various application areas suggest modified language models based on different morphological units. It is reported that performance improvements can be achieved with these modified language models. In our document categorization experiments, we use standard word form based language models as well as other modified language models based on root words, root words and part-of-speech information, truncated word forms and character sequences. Additionally, to find an optimum parameter set, multiple tests are carried out with different language model orders and smoothing methods. Similar to previous studies on other tasks, our experimental results on categorization of Turkish documents reveal that applying linguistic preprocessing steps for language modeling provides improvements over standard language models to some extent. However, it is also observed that similar level of performance improvements can also be acquired by simpler character level or truncated word form models which are language independent.