Abstract

As a result of globalization and better quality of education, a signifcant percentage of the population in Arab countries have become bilingual/multilingual. This has raised the frequency of code-switching and code-mixing among Arabs in daily communication. Consequently, huge amount of Code-Mixed (CM) content can be found on different social media platforms. Such data could be analyzed and used in different Natural Language Processing (NLP) tasks to tackle the challenges emerging due to this multilingual phenomenon. Named-Entity Recognition (NER) is one of the major tasks for several NLP systems. It is the process of identifying named entities in text. However, there is a lack of annotated CM data and resources for such task. This work aims at collecting and building the first annotated CM Arabic-English corpus for NER. Furthermore, we constructed a baseline NER system using deep neural networks and word embeddings for Arabic-English CM text. Moreover, we investigated the usage of different types of classical and contextual pre-trained word embeddings on our system. The highest NER system achieved an F1-score of 77.69% by combining classical and contextual word embeddings.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call