Employing a Multilingual Transformer Model for Segmenting Unpunctuated Arabic Text

Abdullah M Alshanqiti,Ahmad B Alkhodre,Emad Nabil,Sami Albouq,Abdallah Namoun

doi:10.3390/app122010559

Abdullah M Alshanqiti, Ahmad B Alkhodre + Show 3 more

Open Access

https://doi.org/10.3390/app122010559

Copy DOI

Journal: Applied Sciences	Publication Date: Oct 19, 2022
Citations: 2	License type: CC BY 4.0

Affiliation: Islamic University of Madinah, Cairo University

Abstract

Long unpunctuated texts containing complex linguistic sentences are a stumbling block to processing any low-resource languages. Thus, approaches that attempt to segment lengthy texts with no proper punctuation into simple candidate sentences are a vitally important preprocessing task in many hard-to-solve NLP applications. To this end, we propose a preprocessing solution for segmenting unpunctuated Arabic texts into potentially independent clauses. This solution consists of: (1) a punctuation detection model built on top of a multilingual BERT-based model, and (2) some generic linguistic rules for validating the resulting segmentation. Furthermore, we optimize the strategy of applying these linguistic rules using our suggested greedy-like algorithm. We call the proposed solution PDTS (standing for Punctuation Detector for Text Segmentation). Concerning the evaluation, we showcase how PDTS can be effectively employed as a text tokenizer for unpunctuated documents (i.e., mimicking the transcribed audio-to-text documents). Experimental findings across two evaluation protocols (involving an ablation study and a human-based judgment) demonstrate that PDTS is practically effective in both performance quality and computational cost. In particular, PDTS can reach an average F-Measure score of approximately 75%, indicating a minimum improvement of roughly 13% (i.e., compared to the performance of the state-of-the-art competitor models).

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Employing a Multilingual Transformer Model for Segmenting Unpunctuated Arabic Text

Abstract

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

The Application of Cooperative Learning Type of STAD in Mastery of Vocabulary to Making of Simple English Sentences in Elementry Student Grade 5 SDN 9 Kesiman
I Ketut Dharma Laksana ... Ida Ayu Ari Putri Kartini
International Journal of Research Publications | VOL. 57
I Ketut Dharma Laksana, et. al.I Ketut Dharma Laksana ... Ida Ayu Ari Putri Kartini
30 Jul 2020
International Journal of Research Publications | VOL. 57

Chunker for Gujarati Language Using Hybrid Approach
Chetana Tailor ... Bankim Patel
-
Chetana Tailor, et. al.Chetana Tailor ... Bankim Patel
02 Oct 2020
02 Oct 2020

Fuzzy reasoning method in fuzzy rule-based systems with general and specific rules for function approximation
H Ishibuchi
-
H IshibuchiH Ishibuchi
01 Jan 1998
01 Jan 1998

Improving Indic code-mixed to monolingual translation using Mixed Script Augmentation, Generation & Transfer Learning
Rajat Subhra Bhowmick ... Jayanta Paul
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. -
Rajat Subhra Bhowmick, et. al.Rajat Subhra Bhowmick ... Jayanta Paul
04 Jul 2023
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Employing a Multilingual Transformer Model for Segmenting Unpunctuated Arabic Text

Abstract

Talk to us

Similar Papers

More From: Applied Sciences