Less-resourced Languages Research Articles

It is a widely-accepted fact that the processing of very large amounts of data with state-of-the-art Natural Language Processing (NLP) practices (i.e. Machine Learning –ML, language agnostic approaches) has resulted to a dramatic improvement in the speed and efficiency of systems and applications. However, these developments are accompanied with several challenges and difficulties that have been voiced within the last years. Specifically, in regard to NLP, evident improvement in the speed and efficiency of systems and applications with GenAI also entails some aspects that may be problematic, especially when particular text types, languages and/or user groups are concerned. State-of-the-art NLP approaches with automated processing of vast amounts of data in GenAI are related to observed problematic Aspects 1-7, namely: (1) Underrepresentation, (2) Standardization. These result to (3) Barriers in Text Understanding, (4) Discouragement of HCI Usage for Special Text Types and/or User Groups, (5) Barriers in Accessing Information, (6) Likelihood of Errors and False Assumptions and (7) Difficulties in Error Detection and Recovery. An additional problem are typical cases, such as less-resourced languages (A), less experienced users (B) and less agile users (C). A hybrid approach involving the re-introduction and integration of traditional concepts in state-of-the-art processing approaches, whether they are automatic or interactive, concerns the following targets: i), (ii) and (iii): Making more types of information accessible to more types of recipients and user groups (i), Making more types of services accessible and user-friendly to more types of user groups (ii), Making more types of feelings, opinions, voices and reactions visible from more types of user groups (iii) Specifically, in the above-presented cases traditional and classical theories, principles and models are re-introduced and can be integrated into state-of-the art data-driven approaches involving Machine Learning and neural networks, functioning as training data and seed data in Natural Language Processing applications where user requirements and customization are of particular interest and importance. A hybrid approach may be considered a compromise between speed and correctness / userfriendliness in (types of) NLP applications where the achievement of this balance plays a crucial role. In other words, a hybrid approach and the examples presented here target to prevent mechanisms from adopting human biases, ensuring fairness and socially responsible outcome and responsible Social Media. A hybrid approach and the examples presented here also target to customizing content to different linguistic and cultural groups, ensuring equitable information distribution. Here, we present characteristic examples with cases employing the re-introduction of four typical types of traditional concepts concerning classical theories, principles and models. These four typical classical theories, principles and models are also not considered to be flawless, however they can be transformed into practical strategies that can be integrated into evaluation modules, neural networks and training data (including knowledge graphs) and dialogue design. The proposed and discussed re-introduction of traditional concepts is not limited only to the particular models, principles and theories presented here. The first example concerns the application of a classic principle from Theoretical Linguistics. The concept employed in the second example concerns a model from the field of Linguistics and Translation. The third and the fourth examples demonstrate the interdisciplinary application of models and theoretical frameworks from the fields of Linguistics-Cognitive Science and Linguistics-Psychology respectively.

Read full abstract

Abstract One of the most interesting aspects of natural language is how texts cohere, which involves the pragmatic or semantic relations that hold between clauses (addition, cause-effect, conditional, similarity), referred to as discourse relations. A focus on the identification and classification of discourse relations appears as an imperative challenge to be resolved to support tasks such as text summarization, dialogue systems, and machine translation that need information above the clause level. Despite the recent interest in discourse relations in well-known languages such as English, data and experiments are still needed for typologically different and less-resourced languages. We report the most comprehensive investigation of shallow discourse parsing in Turkish, focusing on two main sub-tasks: identification of discourse relation realization types and the sense classification of explicit and implicit relations. The work is based on the approach of fine-tuning a pre-trained language model (BERT) as an encoder and classifying the encoded data with neural network-based classifiers. We firstly identify the discourse relation realization type that holds in a given text, if there is any. Then, we move on to the sense classification of the identified explicit and implicit relations. In addition to in-domain experiments on a held-out test set from the Turkish Discourse Bank (TDB 1.2), we also report the out-domain performance of our models in order to evaluate its generalization abilities, using the Turkish part of the TED Multilingual Discourse Bank. Finally, we explore the effect of multilingual data aggregation on the classification of relation realization type through a cross-lingual experiment. The results suggest that our models perform relatively well despite the limited size of the TDB 1.2 and that there are language-specific aspects of detecting the types of discourse relation realization. We believe that the findings are important both in providing insights regarding the performance of the modern language models in a typologically different language and in the low-resource scenario, given that the TDB 1.2 is 1/20th of the Penn Discourse TreeBank in terms of the number of total relations.

Read full abstract

Less-resourced Languages Research Articles

Related Topics

Articles published on Less-resourced Languages

A Hybrid Ensemble Approach for Greek Text Classification Based on Multilingual Models

HindiSumm: A Hindi Abstractive Summarization benchmark dataset

GenAI and Socially Responsible AI in Natural Language Processing Applications: A Linguistic Perspective

DNN-based Multilingual Acoustic Modeling for Four Ethiopian Languages

Authorship Attribution in Less-Resourced Languages: A Hybrid Transformer Approach for Romanian

Natural Language Processing Tools for Romanian – Going Beyond a Low-Resource Language.

Advancing Italian biomedical information extraction with transformers-based models: Methodological insights and multicenter practical application

Adapting an English Corpus and a Question Answering System for Slovene

Toward a shallow discourse parser for Turkish

Localizing in-domain adaptation of transformer-based biomedical language models

NEW TRENDS IN LESS-RESOURCED LANGUAGE PROCESSING: CASE OF AMAZIGH LANGUAGE

Four Million Segments and Counting: Building an English-Croatian Parallel Corpus through Crowdsourcing Using a Novel Gamification-Based Platform

Sequence-to-sequence pretraining for a less-resourced Slovenian language.

Quinductor: A multilingual data-driven method for generating reading-comprehension questions using Universal Dependencies

Using Multilingual Bidirectional Encoder Representations from Transformers on Medical Corpus for Kurdish Text Classification

Operation LiLi: Using Crowd-Sourced Data and Automatic Alignment to Investigate the Phonetics and Phonology of Less-Resourced Languages

A Corpus for Evaluation of Cross Language Text Re-use Detection Systems

A new corpus annotation framework for Latin diachronic lexical semantics

Cross-lingual alignments of ELMo contextual embeddings

Enhancing deep neural networks with morphological information

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Less-resourced Languages Research Articles

Related Topics

Articles published on Less-resourced Languages

A Hybrid Ensemble Approach for Greek Text Classification Based on Multilingual Models

HindiSumm: A Hindi Abstractive Summarization benchmark dataset

GenAI and Socially Responsible AI in Natural Language Processing Applications: A Linguistic Perspective

DNN-based Multilingual Acoustic Modeling for Four Ethiopian Languages

Authorship Attribution in Less-Resourced Languages: A Hybrid Transformer Approach for Romanian

Natural Language Processing Tools for Romanian – Going Beyond a Low-Resource Language.

Advancing Italian biomedical information extraction with transformers-based models: Methodological insights and multicenter practical application

Adapting an English Corpus and a Question Answering System for Slovene

Toward a shallow discourse parser for Turkish

Localizing in-domain adaptation of transformer-based biomedical language models

NEW TRENDS IN LESS-RESOURCED LANGUAGE PROCESSING: CASE OF AMAZIGH LANGUAGE

Four Million Segments and Counting: Building an English-Croatian Parallel Corpus through Crowdsourcing Using a Novel Gamification-Based Platform

Sequence-to-sequence pretraining for a less-resourced Slovenian language.

Quinductor: A multilingual data-driven method for generating reading-comprehension questions using Universal Dependencies

Using Multilingual Bidirectional Encoder Representations from Transformers on Medical Corpus for Kurdish Text Classification

Operation LiLi: Using Crowd-Sourced Data and Automatic Alignment to Investigate the Phonetics and Phonology of Less-Resourced Languages

A Corpus for Evaluation of Cross Language Text Re-use Detection Systems

A new corpus annotation framework for Latin diachronic lexical semantics

Cross-lingual alignments of ELMo contextual embeddings

Enhancing deep neural networks with morphological information