Abstract
Within the field of legal AI, named entity recognition, also known as NER, is an essential step that must be completed before moving on to subsequent processing stages. In this paper, we present the creation of a dataset for the purpose of training natural language understanding models in the legal domain. The dataset is produced by locating and establishing a complete set of legal entities, which goes beyond traditionally employed entities such as person, organization, and location. These are examples of commonly used entities. Annotators are now provided with the means to effectively tag a wide variety of legal documents thanks to these additional entities. The authors tried out several different text annotation tools before settling on the one that proved to be the most effective for this study. The completed annotations are saved in the JavaScript Object Notation (JSON) format, which makes the data more readable and makes it easier to manipulate the data. The dataset that was produced as a result includes approximately thirty documents and five thousand sentences. Following that, these data are use in order to train a pre-trained SpaCy pipeline for accurate legal named entity prediction. There is a possibility that the accuracy of legal named entity recognition can be improved by performing additional fine-tuning on pre-trained models using legal texts.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have