ABSTRACT Extracting geographic information from texts contributes to both geographic information science research and various practical applications, but extracting fine-grained and complex location descriptions from Chinese text is still challenging, due to flexible word construction and lack of clear boundaries in Chinese place names. In this paper, we propose a regularity-guided and boundary-aware architecture for toponym recognition from Chinese text (RB-TRNet), achieving complex place name recognition by learning the internal compositional patterns of various place name constructions and automatically perceiving the boundaries and types of Chinese place name entities. First, RoBERTa is used to represent the input text containing Chinese place names. Then, two BiLSTM layers are fed with text representation sequences, with one processed sequence entering the toponym regularity-guided module to obtain the composition patterns of Chinese place name entities and the other sequence entering the toponym regularity-discriminant module to soften an excessive reliance on contextual information for recognizing patterns of Chinese place name entities. Additionally, an orthogonal space is established after the BiLSTM network to facilitate the learning of different rule features by the two modules. Finally, after joint optimization training of the three modules, the toponym regularity perception module is used to predict the Chinese place name entities. To validate the results, we established a new complex Chinese place name text (CCPNT) dataset for complex Chinese place name recognition. The CCPNT dataset, along with three other public datasets, were used for performance evaluation, and compared to eight baseline models, RB-TRNet exhibited state-of-the-art performance in recognizing complex Chinese place names.
Read full abstract