Named entity recognition (NER) is an information extraction subtask that attempts to recognize and categorize named entities in unstructured text into predefined categories such as the names of people, organizations, and locations. Recently, machine learning approaches, such as hidden Markov model (HMM) as well as hybrid methods, are frequently used to solve Name Entity Recognition. To the best of our knowledge, publicly available data sets for NER in Persian do not exist in any machine learning-based Persian NER system. Because of HMM innate weaknesses, in this paper, we have used both hidden Markov model and rule-based method to recognize named entities in Persian texts. The combination of rule-based method and machine learning method results in a high accurate recognition. The proposed system in its machine learning section uses HMM and Viterbi algorithms, and in its rule-based section employs a set of lexical resources and pattern bases for the recognition of named entities including the names of people, locations and organizations. During this study, we annotate our own training and testing data sets for use in the related phases. Our hybrid approach performs on Persian language with 89.73% precision, 82.44% recall, and 85.93% F-measure using an annotated test corpus including 32,606 tokens.
Read full abstract