The ever-evolving volume of digital information requires the development of innovative search strategies aimed at obtaining the necessary data efficiently and economically feasible. The urgency of the problem is emphasized by the growing complexity of information landscapes and the need for fast data extraction methodologies. In the field of natural language processing, named entity recognition (NER) is an essential task for extracting useful information from unstructured text input for further classification into predefined categories. Nevertheless, conventional methods frequently encounter difficulties when confronted with a limited amount of labeled data, posing challenges in real-world scenarios where obtaining substantial annotated datasets is problematic or costly. In order to address the problem of domain-specific NER with limited data, this work investigates NER techniques that can overcome these constraints by continuously learning from newly collected information on pre-trained models. Several techniques are also used for making the greatest use of the limited labeled data, such as using active learning, exploiting unlabeled data, and integrating domain knowledge. Using domain-specific datasets with different levels of annotation scarcity, the fine-tuning process of pre-trained models, such as transformer-based models (TRF) and Toc2Vec (token-to-vector) models is investigated. The results show that, in general, expanding the volume of training data enhances most models' performance for NER, particularly for models with sufficient learning ability. Depending on the model architecture and the complexity of the entity label being learned, the effect of more data on the model's performance can change. After increasing the training data by 20%, the LT2V model shows the most balanced growth in accuracy overall by 11% recognizing 73% of entities and processing speed. Meanwhile, with consistent processing speed and the greatest F1-score, the Transformer-based model (TRF) shows promise for effective learning with less data, achieving 74% successful prediction and a 7% increase in performance after expanding the training data to 81%. Our results pave the way for the creation of more resilient and efficient NER systems suited to specialized domains and further the field of domain-specific NER with sparse data. We also shed light on the relative merits of various NER models and training strategies, and offer perspectives for future research.
Read full abstract