The named entity recognition (NER) in the field of public interest litigation can assist prosecutors in handling cases and provide them with specific entities in making legal documents. Previously, the context-free deep learning model is used to catch the semantic comprehension, in which the static word vector is obtained without considering the context. Moreover, this kind of method relies on word segmentation technology and cannot solve the error transmission caused by word segmentation inaccuracy, which brings great challenges to the Chinese NER task. To tackle the above issues, an entity recognition method based on pretraining is proposed. First, based on the basic entities, three legal ontologies, NERP, NERCGP, and NERFPP are developed to expand the named entity recognition corpus in the judicial field. Second, a variant of the pretrained model BERT (Bidirectional Encoder Representations from Transformer) called BERT-WWM (whole-word mask)-EXT(extra) is introduced to catch the text character-level word vector hierarchical and the context bidirectional features, which effectively solve the problem of task boundary division of named entities. Then, to further improve the model recognition effect, the general knowledge learned from the pretrained model is used to fit the downstream neural network BiLSTM (bi-long short-term memory), and at the end of the architecture, CRF (conditional random fields) is introduced to restrict the label relationship. Finally, the experimental results show that the proposed method is more effective than the existing methods, which reach 96% and 90% in the F1 index of NER and NERP entities, respectively.
Read full abstract