Abstract

Language identification and content detection are essential for ensuring effective digital communication, and content moderation. While extensive research has primarily focused on well-known and widely spoken languages, challenges persist when dealing with indigenous and resource-limited languages, especially between closely similar languages such as Ethiopian languages. This article aims to simultaneously identify the language of a given text and detect its content, and to achieve this, we propose a novel attention-based recurrent neural network framework. The proposed method has an attention-embedded Bidirectional-LSTM architecture with two classifiers that identify the language of a given text and content within the text. The two classifiers share a common feature space before they branched at their task-specific layers where both layers are assisted by attention mechanism. We use five different topics in Six Ethiopian Languages the dataset consists of nearly 22,624 sentences. We compared our result with the classical NLP techniques, the proposed method shortened the data prepossessing steps. We evaluated the model performance using the accuracy metric, achieving results of 98.88% for language identification and 96.5% for text content detection. The dataset, source code, and pretrained model are available at https://github.com/bdu-birhanu/LID_TCD .

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call