Low-resource Domains Research Articles

<p>With the rise of machine translation systems, it has become essential to evaluate the quality of translations produced by these systems. However, the existing evaluation metrics designed for English and other European languages may not always be suitable or apply to other Indic languages due to their complex morphology and syntax. Machine translation evaluation (MTE) is a process of assessing the quality and accuracy of the machine-translated text. MTE involves comparing the machine-translated output with the reference translation to calculate the level of similarity and correctness. Therefore, this study evaluates different metrics, namely, BLEU, METEOR, and TER to identify the most suitable evaluation metric for Indic languages. The study uses datasets for Indic languages and evaluates the metrics on various translation systems. The study contributes to the field of MT by providing insights into suitable evaluation metrics for Indic languages. This research paper aims to study and compare several lexical automatic machine translation evaluation metrics for Indic languages. For this research analysis, we have selected five language pairs of parallel corpora from the low-resource domain, such as English–Hindi, English-Punjabi, English-Gujarati, English-Marathi, and English-Bengali. All these languages belong to the Indo-Aryan language family and are resource-poor. A comparison of the state of art MT is presented and shows which translator works better on these language pairs. For this research work, the natural language toolkit tokenizers are used to assess the analysis of the experimental results. These results have been performed by taking two different datasets for all these language pairs using fully automatic MT evaluation metrics. The research study explores the effectiveness of these metrics in assessing the quality of machine translations between various Indic languages. Additionally, this dataset and analysis will make it easier to do future research in Indian MT evaluation.</p>

Read full abstract

In low-resource domains, it is challenging to achieve good performance using existing machine learning methods due to a lack of training data and mixed data types (numeric and categorical). In particular, categorical variables with high cardinality pose a challenge to machine learning tasks such as classification and regression because training requires sufficiently many data points for the possible values of each variable. Since interpolation is not possible, nothing can be learned for values not seen in the training set. This paper presents a method that uses prior knowledge of the application domain to support machine learning in cases with insufficient data. We propose to address this challenge by using embeddings for categorical variables that are based on an explicit representation of domain knowledge (KR), namely a hierarchy of concepts. Our approach is to 1. define a semantic similarity measure between categories, based on the hierarchy—we propose a purely hierarchy-based measure, but other similarity measures from the literature can be used—and 2. use that similarity measure to define a modified one-hot encoding. We propose two embedding schemes for single-valued and multi-valued categorical data. We perform experiments on three different use cases. We first compare existing similarity approaches with our approach on a word pair similarity use case. This is followed by creating word embeddings using different similarity approaches. A comparison with existing methods such as Google, Word2Vec and GloVe embeddings on several benchmarks shows better performance on concept categorisation tasks when using knowledge-based embeddings. The third use case uses a medical dataset to compare the performance of semantic-based embeddings and standard binary encodings. Significant improvement in performance of the downstream classification tasks is achieved by using semantic information.

Read full abstract

Low-resource Domains Research Articles

Related Topics

Articles published on Low-resource Domains

Curriculum meta-learning for zero-shot cross-lingual transfer

Cross-Domain Aspect-Based Sentiment Classification with a Pre-Training and Fine-Tuning Strategy for Low-Resource Domains

Natural Language Understanding for Navigation of Service Robots in Low-Resource Domains and Languages: Scenarios in Spanish and Nahuatl

Robust Few-Shot Named Entity Recognition with Boundary Discrimination and Correlation Purification

A comparative analysis of lexical-based automatic evaluation metrics for different Indic language pairs

Metadial: A Meta-learning Approach for Arabic Dialogue Generation

High-throughput and area-efficient architectures for image encryption using PRINCE cipher

Uncertainty Estimation and Reduction of Pre-trained Models for Text Regression

MTL-DAS: Automatic Text Summarization for Domain Adaptation.

Hierarchy-based semantic embeddings for single-valued & multi-valued categorical variables

MELINDA: A Multimodal Dataset for Biomedical Experiment Method Classification

Meta-Curriculum Learning for Domain Adaptation in Neural Machine Translation

Multilingual Automatic Term Extraction in Low-Resource Domains

Adaptation in Statistical Machine Translation for Low-resource Domains in English-Vietnamese Language

MetaMT, a Meta Learning Method Leveraging Multiple Domain Data for Low Resource Machine Translation.

Efficient hardware implementations of QTL cipher for RFID applications

A Graph Attention Model for Dictionary-Guided Named Entity Recognition

素性空間拡張法に基づくフレーズベース統計翻訳のマルチドメイン適応

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Low-resource Domains Research Articles

Related Topics

Articles published on Low-resource Domains

Curriculum meta-learning for zero-shot cross-lingual transfer

Cross-Domain Aspect-Based Sentiment Classification with a Pre-Training and Fine-Tuning Strategy for Low-Resource Domains

Natural Language Understanding for Navigation of Service Robots in Low-Resource Domains and Languages: Scenarios in Spanish and Nahuatl

Robust Few-Shot Named Entity Recognition with Boundary Discrimination and Correlation Purification

A comparative analysis of lexical-based automatic evaluation metrics for different Indic language pairs

Metadial: A Meta-learning Approach for Arabic Dialogue Generation

High-throughput and area-efficient architectures for image encryption using PRINCE cipher

Uncertainty Estimation and Reduction of Pre-trained Models for Text Regression

MTL-DAS: Automatic Text Summarization for Domain Adaptation.

Hierarchy-based semantic embeddings for single-valued & multi-valued categorical variables

MELINDA: A Multimodal Dataset for Biomedical Experiment Method Classification

Meta-Curriculum Learning for Domain Adaptation in Neural Machine Translation

Multilingual Automatic Term Extraction in Low-Resource Domains

Adaptation in Statistical Machine Translation for Low-resource Domains in English-Vietnamese Language

MetaMT, a Meta Learning Method Leveraging Multiple Domain Data for Low Resource Machine Translation.

Efficient hardware implementations of QTL cipher for RFID applications

A Graph Attention Model for Dictionary-Guided Named Entity Recognition

素性空間拡張法に基づくフレーズベース統計翻訳のマルチドメイン適応