Abstract
Language identification and content detection are essential for ensuring effective digital communication, and content moderation. While extensive research has primarily focused on well-known and widely spoken languages, challenges persist when dealing with indigenous and resource-limited languages, especially between closely similar languages such as Ethiopian languages. This article aims to simultaneously identify the language of a given text and detect its content, and to achieve this, we propose a novel attention-based recurrent neural network framework. The proposed method has an attention-embedded Bidirectional-LSTM architecture with two classifiers that identify the language of a given text and content within the text. The two classifiers share a common feature space before they branched at their task-specific layers where both layers are assisted by attention mechanism. We use five different topics in Six Ethiopian Languages the dataset consists of nearly 22,624 sentences. We compared our result with the classical NLP techniques, the proposed method shortened the data prepossessing steps. We evaluated the model performance using the accuracy metric, achieving results of 98.88% for language identification and 96.5% for text content detection. The dataset, source code, and pretrained model are available at https://github.com/bdu-birhanu/LID_TCD .
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: ACM Transactions on Asian and Low-Resource Language Information Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.