Named Entity Recognition (NER) is a significant information extraction task since it is an important component of many natural language processing applications, such as Information Retrieval, Question Answering and Speech Recognition. The complexity and morphological richness of the Arabic language is the main reason why most existing Arabic NER systems rely strongly on hand-crafted feature engineering. In this paper, we propose to augment the existing LSTM neural tagging model for Arabic NER with a Convolutional Neural Network (CNN) for the extraction of relevant character-level features. By operating on the character-level, the proposed model is able to handle out-of-vocabulary words. Our results show that character CNN is able to outperform the previously used character-level Bi-directional Long Short-Term Memory Networks (BiLSTM) in many settings. Moreover, our observations indicate that CNNs tend to perform better than BiLSTM on relatively longer tokens. In addition, we conduct a comparison of four different pre-trained word vector models for Arabic NER and results show that a Skip-Gram Word2vec model, pre-trained on a subset of the Arabic Gigaword corpus, is generally sufficient to obtain acceptable Arabic NER performance.
Read full abstract