Similar Papers
  • Research Article
  • 10.3389/frai.2025.1579998
Dynamic taxonomy generation for future skills identification using a named entity recognition and relation extraction pipeline
  • Jul 2, 2025
  • Frontiers in Artificial Intelligence
  • Luis Jose Gonzalez-Gomez + 6 more

IntroductionThe labor market is rapidly evolving, leading to a mismatch between existing Knowledge, Skills, and Abilities (KSAs) and future occupational requirements. Reports from organizations like the World Economic Forum and the OECD emphasize the need for dynamic skill identification. This paper introduces a novel system for constructing a dynamic taxonomy using Natural Language Processing (NLP) techniques, specifically Named Entity Recognition (NER) and Relation Extraction (RE), to identify and predict future skills. By leveraging machine learning models, this taxonomy aims to bridge the gap between current skills and future demands, contributing to educational and professional development.MethodsTo achieve this, an NLP-based architecture was developed using a combination of text preprocessing, NER, and RE models. The NER model identifies and categorizes KSAs and occupations from a corpus of labor market reports, while the RE model establishes the relationships between these entities. A custom pipeline was used for PDF text extraction, tokenization, and lemmatization to standardize the data. The models were trained and evaluated using over 1,700 annotated documents, with the training process optimized for both entity recognition and relationship prediction accuracy.ResultsThe NER and RE models demonstrated promising performance. The NER model achieved a best micro-averaged F1-score of 65.38% in identifying occupations, skills, and knowledge entities. The RE model subsequently achieved a best micro-F1 score of 82.2% for accurately classifying semantic relationships between these entities at epoch 1,009. The taxonomy generated from these models effectively identified emerging skills and occupations, offering insights into future workforce requirements. Visualizations of the taxonomy were created using various graph structures, demonstrating its applicability across multiple sectors. The results indicate that this system can dynamically update and adapt to changes in skill demand over time.DiscussionThe dynamic taxonomy model not only provides real-time updates on current competencies but also predicts emerging skill trends, offering a valuable tool for workforce planning. The high recall rates in NER suggest strong entity recognition capabilities, though precision improvements are needed to reduce false positives. Limitations include the need for a larger corpus and sector-specific models. Future work will focus on expanding the corpus, improving model accuracy, and incorporating expert feedback to further refine the taxonomy.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 27
  • 10.3390/app12030976
BERT-Based Transfer-Learning Approach for Nested Named-Entity Recognition Using Joint Labeling
  • Jan 18, 2022
  • Applied Sciences
  • Ankit Agrawal + 5 more

Named-entity recognition (NER) is one of the primary components in various natural language processing tasks such as relation extraction, information retrieval, question answering, etc. The majority of the research work deals with flat entities. However, it was observed that the entities were often embedded within other entities. Most of the current state-of-the-art models deal with the problem of embedded/nested entity recognition with very complex neural network architectures. In this research work, we proposed to solve the problem of nested named-entity recognition using the transfer-learning approach. For this purpose, different variants of fine-tuned, pretrained, BERT-based language models were used for the problem using the joint-labeling modeling technique. Two nested named-entity-recognition datasets, i.e., GENIA and GermEval 2014, were used for the experiment, with four and two levels of annotation, respectively. Also, the experiments were performed on the JNLPBA dataset, which has flat annotation. The performance of the above models was measured using F1-score metrics, commonly used as the standard metrics to evaluate the performance of named-entity-recognition models. In addition, the performance of the proposed approach was compared with the conditional random field and the Bi-LSTM-CRF model. It was found that the fine-tuned, pretrained, BERT-based models outperformed the other models significantly without requiring any external resources or feature extraction. The results of the proposed models were compared with various other existing approaches. The best-performing BERT-based model achieved F1-scores of 74.38, 85.29, and 80.68 for the GENIA, GermEval 2014, and JNLPBA datasets, respectively. It was found that the transfer learning (i.e., pretrained BERT models after fine-tuning) based approach for the nested named-entity-recognition task could perform well and is a more generalized approach in comparison to many of the existing approaches.

  • Research Article
  • 10.36244/icj.2025.2.4
A Comparative Analysis of Static Word Embeddings for Hungarian
  • Jan 1, 2025
  • Infocommunications journal
  • Máté Gedeon

This paper presents a comprehensive analysis of various static word embeddings for the Hungarian language, including traditional models such as Word2Vec, FastText, as well as static embeddings derived from BERT-based models using different extraction methods. We evaluate these embeddings on both intrinsic and extrinsic tasks to provide a holistic view of their performance. For intrinsic evaluation, we employ a word analogy task, which assesses the embeddings’ ability to capture semantic and syntactic relationships. Our results indicate that traditional static embeddings, particularly FastText, excel in this task, achieving high accuracy and mean reciprocal rank (MRR) scores. Among the BERT-based models, the X2Static method for extracting static embeddings demonstrates superior performance compared to decontextualized and aggregate methods, approaching the effectiveness of traditional static embeddings. For extrinsic evaluation, we utilize a bidirectional LSTM model to perform Named Entity Recognition (NER) and Part-ofSpeech (POS) tagging tasks. The results reveal that embeddings derived from dynamic models, especially those extracted using the X2Static method, outperform purely static embeddings. Notably, ELMo embeddings achieve the highest accuracy in both NER and POS tagging tasks, underscoring the benefits of contextualized representations even when used in a static form. Our findings highlight the continued relevance of static word embeddings in NLP applications and the potential of advanced extraction methods to enhance the utility of BERT-based models. This piece of research contributes to the understanding of embedding performance in the Hungarian language and provides valuable insights for future developments in the field. The training scripts, evaluation codes, restricted vocabulary, and extracted embeddings will be made publicly available to support further research and reproducibility.

  • Research Article
  • Cite Count Icon 3
  • 10.2196/59782
Evaluating Medical Entity Recognition in Health Care: Entity Model Quantitative Study.
  • Oct 17, 2024
  • JMIR medical informatics
  • Shengyu Liu + 4 more

Named entity recognition (NER) models are essential for extracting structured information from unstructured medical texts by identifying entities such as diseases, treatments, and conditions, enhancing clinical decision-making and research. Innovations in machine learning, particularly those involving Bidirectional Encoder Representations From Transformers (BERT)-based deep learning and large language models, have significantly advanced NER capabilities. However, their performance varies across medical datasets due to the complexity and diversity of medical terminology. Previous studies have often focused on overall performance, neglecting specific challenges in medical contexts and the impact of macrofactors like lexical composition on prediction accuracy. These gaps hinder the development of optimized NER models for medical applications. This study aims to meticulously evaluate the performance of various NER models in the context of medical text analysis, focusing on how complex medical terminology affects entity recognition accuracy. Additionally, we explored the influence of macrofactors on model performance, seeking to provide insights for refining NER models and enhancing their reliability for medical applications. This study comprehensively evaluated 7 NER models-hidden Markov models, conditional random fields, BERT for Biomedical Text Mining, Big Transformer Models for Efficient Long-Sequence Attention, Decoding-enhanced BERT with Disentangled Attention, Robustly Optimized BERT Pretraining Approach, and Gemma-across 3 medical datasets: Revised Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA), BioCreative V CDR, and Anatomical Entity Mention (AnatEM). The evaluation focused on prediction accuracy, resource use (eg, central processing unit and graphics processing unit use), and the impact of fine-tuning hyperparameters. The macrofactors affecting model performance were also screened using the multilevel factor elimination algorithm. The fine-tuned BERT for Biomedical Text Mining, with balanced resource use, generally achieved the highest prediction accuracy across the Revised JNLPBA and AnatEM datasets, with microaverage (AVG_MICRO) scores of 0.932 and 0.8494, respectively, highlighting its superior proficiency in identifying medical entities. Gemma, fine-tuned using the low-rank adaptation technique, achieved the highest accuracy on the BioCreative V CDR dataset with an AVG_MICRO score of 0.9962 but exhibited variability across the other datasets (AVG_MICRO scores of 0.9088 on the Revised JNLPBA and 0.8029 on AnatEM), indicating a need for further optimization. In addition, our analysis revealed that 2 macrofactors, entity phrase length and the number of entity words in each entity phrase, significantly influenced model performance. This study highlights the essential role of NER models in medical informatics, emphasizing the imperative for model optimization via precise data targeting and fine-tuning. The insights from this study will notably improve clinical decision-making and facilitate the creation of more sophisticated and effective medical NER models.

  • Research Article
  • Cite Count Icon 1
  • 10.1093/database/baae079
Integrating deep learning architectures for enhanced biomedical relation extraction: a pipeline approach.
  • Aug 28, 2024
  • Database : the journal of biological databases and curation
  • M Janina Sarol + 3 more

Biomedical relation extraction from scientific publications is a key task in biomedical natural language processing (NLP) and can facilitate the creation of large knowledge bases, enable more efficient knowledge discovery, and accelerate evidence synthesis. In this paper, building upon our previous effort in the BioCreative VIII BioRED Track, we propose an enhanced end-to-end pipeline approach for biomedical relation extraction (RE) and novelty detection (ND) that effectively leverages existing datasets and integrates state-of-the-art deep learning methods. Our pipeline consists of four tasks performed sequentially: named entity recognition (NER), entity linking (EL), RE, and ND. We trained models using the BioRED benchmark corpus that was the basis of the shared task. We explored several methods for each task and combinations thereof: for NER, we compared a BERT-based sequence labeling model that uses the BIO scheme with a span classification model. For EL, we trained a convolutional neural network model for diseases and chemicals and used an existing tool, PubTator 3.0, for mapping other entity types. For RE and ND, we adapted the BERT-based, sentence-bound PURE model to bidirectional and document-level extraction. We also performed extensive hyperparameter tuning to improve model performance. We obtained our best performance using BERT-based models for NER, RE, and ND, and the hybrid approach for EL. Our enhanced and optimized pipeline showed substantial improvement compared to our shared task submission, NER: 93.53 (+3.09), EL: 83.87 (+9.73), RE: 46.18 (+15.67), and ND: 38.86 (+14.9). While the performances of the NER and EL models are reasonably high, RE and ND tasks remain challenging at the document level. Further enhancements to the dataset could enable more accurate and useful models for practical use. We provide our models and code at https://github.com/janinaj/e2eBioMedRE/. Database URL: https://github.com/janinaj/e2eBioMedRE/.

  • Research Article
  • Cite Count Icon 6
  • 10.1609/aaai.v37i11.26571
AUC Maximization for Low-Resource Named Entity Recognition
  • Jun 26, 2023
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Ngoc Dang Nguyen + 5 more

Current work in named entity recognition (NER) uses either cross entropy (CE) or conditional random fields (CRF) as the objective/loss functions to optimize the underlying NER model. Both of these traditional objective functions for the NER problem generally produce adequate performance when the data distribution is balanced and there are sufficient annotated training examples. But since NER is inherently an imbalanced tagging problem, the model performance under the low-resource settings could suffer using these standard objective functions. Based on recent advances in area under the ROC curve (AUC) maximization, we propose to optimize the NER model by maximizing the AUC score. We give evidence that by simply combining two binary-classifiers that maximize the AUC score, significant performance improvement over traditional loss functions is achieved under low-resource NER settings. We also conduct extensive experiments to demonstrate the advantages of our method under the low-resource and highly-imbalanced data distribution settings. To the best of our knowledge, this is the first work that brings AUC maximization to the NER setting. Furthermore, we show that our method is agnostic to different types of NER embeddings, models and domains. The code of this work is available at https://github.com/dngu0061/NER-AUC-2T.

  • Research Article
  • Cite Count Icon 1
  • 10.3897/biss.8.140428
BiodiViz: Leveraging NER and RE for Automated Knowledge Graph Generation in Biodiversity Research
  • Oct 29, 2024
  • Biodiversity Information Science and Standards
  • Angela Shannen Tan + 2 more

In biodiversity research, the integration of machine learning and data visualization is increasingly important for uncovering valuable insights from academic literature. This study introduces an innovative knowledge graph application, BiodiViz, designed to translate intricate text into intuitive visual representations, fostering a deeper comprehension of biodiversity relationships. BiodiViz uses the top-performing Named Entity Recognition (NER) and Relation Extraction (RE) models to automatically generate a comprehensive knowledge graph for biodiversity research. The NER model extracts and categorizes entities like organisms, phenomena, and habitats, while the RE model identifies relationships such as "have," "occur in," and "influence" from the BiodivNERE dataset (Abdelmageed et al. 2022). These entities and relationships are organized into nodes and edges within a graph. Researchers input text into BiodiViz, producing a visual knowledge graph that simplifies the analysis of complex biodiversity data, reducing manual effort and enhancing efficiency. Named Entity Recognition & Relation Extraction BiodiViz leverages advanced Bidirectional Encoder Representations from Transformers (BERT)-based Large Language Models (LLMs) (Rogers et al. 2020), fine-tuned specifically for NER and RE tasks using the BiodivNERE dataset. The fine-tuning process involved various models, including BERT (Devlin et al. 2019), ELECTRA (Clark et al. 2020), and BiodivBERT (Abdelmageed et al. 2023). These models were evaluated for performance using the results of their F1-score as the main metric, which is the harmonic mean of precision (the proportion of true positive results among all positive predictions) and recall (the proportion of true positive results among all actual positives), with BiodivBERT achieving an F1-score of 77.16% for the NER task, while BERT excelled in the RE task with an F1-score of 81.28%. Rigorous hyperparameter optimization further enhanced the performance of BiodivBERT in the RE task by 3.38%. The BiodivNERE corpora by Abdelmageed et al. (2022) were used to fine-tune several models for NER and RE tasks in the biodiversity domain. The first corpus from the BiodivNERE corpora is BiodivNER, which is a gold standard dataset (manually labelled test corpora) for evaluating NER tasks. The fine-tuning process employed the token classification method from the Hugging Face library (Hugging Face 2023b), which assigns labels to each token in a sequence. Experiments were conducted with a batch size of four, meaning the model processes four examples/rows of data at a time before making an update to improve its learning. This is due to the constraints of the NVIDIA® GeForce RTX™ 3060 graphics processor. (NVIDIA 2024) Model performance was evaluated using the seqeval library (Nakayama 2018), focusing on accuracy, precision, recall, and F1 scores. For text classification, the second corpus, BiodivRE, was utilized, following previous research recommendations to explore fine-tuning settings for BiodivBERT. Hyperparameter optimization (Feurer and Hutter 2019) was conducted using Hugging Face’s Trainer API with an Optuna backend (Hugging Face 2023a), concentrating on learning rate and the number of training epochs (i.e., the number of complete passes through the entire dataset during model training). The BiodiViz Knowledge Graph Application The fine-tuned NER and RE models with the best F1-scores—BiodivBERT and BERT, respectively—were integrated into the knowledge graph application. Fig. 1 illustrates the flowchart of the application pipeline. Each sentence in the input text will go through the NER model to identify and label the entities within the sentence. Subsequently, these labeled entities, together with the original sentence, will be input into the RE model. The RE model will analyze every pair of entities for a potential relation and output the type of relation they share. The application will then utilize this data to create a graph with appropriate labels and color-coding. An example of the application's user interface with the knowledge graph is shown in Fig. 2. This study highlights the practical application of machine learning and data visualization in advancing biodiversity research, emphasizing the importance of developing user-friendly tools to support scientific exploration and discovery. The BiodiViz application, including the code and resources, is available on GitHub*1, providing an accessible tool for biodiversity researchers to streamline their analyses.

  • Research Article
  • 10.1200/jco.2025.43.16_suppl.e13607
Comparing traditional NLP methods and LLM-based extraction for identifying biomarkers in lung cancer.
  • Jun 1, 2025
  • Journal of Clinical Oncology
  • Jiby Joseph-Thomas + 11 more

e13607 Background: Lung cancer treatment relies heavily on genomic biomarkers, which are often recorded in unstructured documents like clinical notes. Extracting and interpreting this data can help identify eligible patients for clinical trials. Traditional Named Entity Recognition (NER) models, such as Regex and Long Short-Term Memory (LSTM), have been useful in identifying entities, Small Language Models (SLMs) and Large Language Models (LLMs) based NER have shown promise in handling complex and variable information. The objective of this study was to compare these two methods for extracting genomic data from oncology notes in the EHRs. Methods: We analyzed 27 de-identified clinical notes from 29 lung cancer cases with biomarker data. The notes were sourced from different hospitals, after ethical committee approval, ensuring variability in documentation. Notes were processed by "SLM & LLM-based NER model" and "pre-transformed traditional NER." F1 scores of both the models were compared for clinically relevant attributes of genomic markers, such as categorical results, exonic location, variant type, and genomic alterations. Results: The SLM and LLM based NER model outperformed the traditional NER model in identifying the biomarker entity, variant type and categorical results (Table). In addition, a qualitative assessment of other attributes like exon location and genomic alterations which were not available through traditional NER models and could be extracted satisfactorily through the SLM and LLM based NER model for e.g. MET exon 14 and EGFR genomic alteration had F1 score of 0.8 and 0.75, respectively. Conclusions: In precision oncology, identifying biomarker variants is crucial for targeted interventions. Clinical notes are a rich source of patient information including genomic data, making them key evidence to enrich the database. Our study demonstrates that SLM and LLM based NER models are better at distinguishing contextual information, improving their ability to perform precise information extraction, such as differentiating between 'EGFR' as a biomarker and 'eGFR’ as lab test and hence can significantly aid in extracting precision oncology data from unstructured clinical notes. This approach enhances the ability to support personalized treatments and clinical trials. Variable Biomarker name Traditional NER model Biomarker nameSLM & LLM based NER Variant type Traditional NER model Variant type SLM & LLM based NER Categorical result Traditional NER model Categorical result SLM & LLM based NER EGFR 0.85 0.98 0.04 0.87 0.38 0.78 ALK 0.89 0.99 0.03 0.97 0.50 0.86 ROS1 0.86 0.98 0.02 0.83 0.55 0.84 KRAS 0.85 0.97 0.02 0.90 0.46 0.52 BRAF 0.89 0.99 0.04 0.92 0.51 0.90 HER2 0.84 0.98 0.14 0.86 0.65 0.97 MET 0.81 0.96 0.02 0.83 0.60 0.83 RET 0.82 0.96 0.01 0.79 0.62 0.77 PD-L1 0.85 0.95 0.037 0.800 0.58 0.83

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 5
  • 10.3390/info11020082
Enhancing the Performance of Telugu Named Entity Recognition Using Gazetteer Features
  • Feb 2, 2020
  • Information
  • Saikiranmai Gorla + 2 more

Named entity recognition (NER) is a fundamental step for many natural language processing tasks and hence enhancing the performance of NER models is always appreciated. With limited resources being available, NER for South-East Asian languages like Telugu is quite a challenging problem. This paper attempts to improve the NER performance for Telugu using gazetteer-related features, which are automatically generated using Wikipedia pages. We make use of these gazetteer features along with other well-known features like contextual, word-level, and corpus features to build NER models. NER models are developed using three well-known classifiers—conditional random field (CRF), support vector machine (SVM), and margin infused relaxed algorithms (MIRA). The gazetteer features are shown to improve the performance, and theMIRA-based NER model fared better than its counterparts SVM and CRF.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 5
  • 10.2196/32867
A Disease Identification Algorithm for Medical Crowdfunding Campaigns: Validation Study
  • Jun 21, 2022
  • Journal of Medical Internet Research
  • Steven S Doerstling + 4 more

BackgroundWeb-based crowdfunding has become a popular method to raise money for medical expenses, and there is growing research interest in this topic. However, crowdfunding data are largely composed of unstructured text, thereby posing many challenges for researchers hoping to answer questions about specific medical conditions. Previous studies have used methods that either failed to address major challenges or were poorly scalable to large sample sizes. To enable further research on this emerging funding mechanism in health care, better methods are needed.ObjectiveWe sought to validate an algorithm for identifying 11 disease categories in web-based medical crowdfunding campaigns. We hypothesized that a disease identification algorithm combining a named entity recognition (NER) model and word search approach could identify disease categories with high precision and accuracy. Such an algorithm would facilitate further research using these data.MethodsWeb scraping was used to collect data on medical crowdfunding campaigns from GoFundMe (GoFundMe Inc). Using pretrained NER and entity resolution models from Spark NLP for Healthcare in combination with targeted keyword searches, we constructed an algorithm to identify conditions in the campaign descriptions, translate conditions to International Classification of Diseases, 10th Revision, Clinical Modification (ICD-10-CM) codes, and predict the presence or absence of 11 disease categories in the campaigns. The classification performance of the algorithm was evaluated against 400 manually labeled campaigns.ResultsWe collected data on 89,645 crowdfunding campaigns through web scraping. The interrater reliability for detecting the presence of broad disease categories in the campaign descriptions was high (Cohen κ: range 0.69-0.96). The NER and entity resolution models identified 6594 unique (276,020 total) ICD-10-CM codes among all of the crowdfunding campaigns in our sample. Through our word search, we identified 3261 additional campaigns for which a medical condition was not otherwise detected with the NER model. When averaged across all disease categories and weighted by the number of campaigns that mentioned each disease category, the algorithm demonstrated an overall precision of 0.83 (range 0.48-0.97), a recall of 0.77 (range 0.42-0.98), an F1 score of 0.78 (range 0.56-0.96), and an accuracy of 95% (range 90%-98%).ConclusionsA disease identification algorithm combining pretrained natural language processing models and ICD-10-CM code–based disease categorization was able to detect 11 disease categories in medical crowdfunding campaigns with high precision and accuracy.

  • Book Chapter
  • 10.1007/978-981-16-3357-7_2
A Comprehensive Analysis of Subword Contextual Embeddings for Languages with Rich Morphology
  • Nov 13, 2021
  • Arda Akdemir + 2 more

Deep language models such as BERT pretrained on large-scale datasets have enabled remarkable progress in a wide range of NLP tasks and became the standard approach for many languages. However, an in-depth understanding of the effect of using these models is still missing for less spoken languages. This study gives a comprehensive analysis of using the BERT model for languages with rich morphology. We experimented with cross-lingual, multilingual, and monolingual BERT models and three non-BERT-based models. We evaluated these models on five morphologically rich languages (Finnish, Czech, Hungarian, Turkish, Japanese) and the English language. Evaluated on Dependency Parsing (DEP) and Named Entity Recognition (NER) tasks, which benefit highly from morphological information, BERT-based models consistently outperformed other approaches. Results revealed that the effects of using BERT-based models significantly differ across languages. Moreover, our analysis provided various critical findings of multi-task learning (MTL), transfer learning, and external features in different settings. We further verified these findings on noisy datasets for the Sentiment Analysis task as a case study. Finally, the proposed BERT-based model achieved new state-of-the-art results on both DEP and NER tasks for the Turkish language.KeywordsTransfer learningDeep learningNLPNamed entity recognitionMulti-task learningSubword contextual embeddingsDependency parsingMorphology

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 3
  • 10.3390/app13085163
Web Interface of NER and RE with BERT for Biomedical Text Mining
  • Apr 21, 2023
  • Applied Sciences
  • Yeon-Ji Park + 4 more

The BioBERT Named Entity Recognition (NER) model is a high-performance model designed to identify both known and unknown entities. It surpasses previous NER models utilized by text-mining tools, such as tmTool and ezTag, in effectively discovering novel entities. In previous studies, the Biomedical Entity Recognition and Multi-Type Normalization Tool (BERN) employed this model to identify words that represent specific names, discern the type of the word, and implement it on a web page to offer NER service. However, we aimed to offer a web service that includes Relation Extraction (RE), a task determining the relation between entity pairs within a sentence. First, just like BERN, we fine-tuned the BioBERT NER model within the biomedical domain to recognize new entities. We identified two categories: diseases and genes/proteins. Additionally, we fine-tuned the BioBERT RE model to determine the presence or absence of a relation between the identified gene–disease entity pairs. The NER and RE results are displayed on a web page using the Django web framework. NER results are presented in distinct colors, and RE results are visualized as graphs in NetworkX and Cytoscape, allowing users to interact with the graphs.

  • Research Article
  • Cite Count Icon 13
  • 10.1109/access.2021.3124268
Unified Transformer Multi-Task Learning for Intent Classification With Entity Recognition
  • Jan 1, 2021
  • IEEE Access
  • Alberto Benayas + 4 more

Intent classification (IC) and Named Entity Recognition (NER) are arguably the two main components needed to build a Natural Language Understanding (NLU) engine, which is a main component of conversational agents. The IC and NER components are closely intertwined and the entities are often connected to the underlying intent. Current research has primarily focused to model IC and NER as two separate units, which results in error propagation, and thus, sub-optimal performance. In this paper, we propose a simple yet effective novel framework for NLU where the parameters of the IC and the NER models are jointly trained in a consolidated parameter space. Text semantic representations are obtained from popular pre-trained contextual language models, which are fine-tuned for our task, and these parameters are propagated to other deep neural layers in our framework leading to a faithful unified modelling of the IC and NER parameters. The overall framework results in a faithful parameter sharing when the training is underway, leading to a more coherent learning. Experiments on two public datasets, ATIS and SNIPS, show that our model outperforms other methods by a noticeable margin. On the SNIPS dataset, we obtain a 1.42% improvement in NER in terms of the F1 score, and 1% improvement in intent accuracy score. On ATIS, we achieve 1.54% improvement in intent accuracy score. We also present qualitative results to showcase the effectiveness of our model.

  • Conference Article
  • 10.1109/dsc55868.2022.00084
Research on Multi-Model Fusion for Named Entity Recognition Based on Loop Parameter Sharing Transfer Learning
  • Jul 1, 2022
  • Haoran Ma + 1 more

The biggest problem confronted by named entity recognition (NER) in practical applications is that the number of labeled corpora in most application domains is small due to the high labeling cost; Only a few domains have a large number of well labeled corpora. The lack of labeled corpora is the biggest problem that restricts the application of NER in reality. To address the above problems, we propose T-NER, a model based on transfer learning, in this paper. Structurally, T-NER is of multi-model fusion architecture: the multi-dimensional and multi-level fusion of the sequence learning models BiLSTM and BiGRU is performed; In terms of method, T-NER is a transfer learning based on parameter sharing, which uses the model to train the richly labeled corpora in the source domain, and shares the model parameters of the trained corpora in the source domain to the target domain, so as to transfer the knowledge of the source domain to the target domain. The implementation of model parameter sharing is based on the loop structure of T_NER. After the corpus in the source domain is trained, its model parameters will return to the initial part of the model by utilizing the loop structure of the model to participate in the training of the target domain. In the testing process, the training speed and training results of T_NER in small sample experiments are better than the base model, which shows the strong superiority of the T_NER model in the NER of small samples; In large sample experiments, the effect of the T_NER model is not inferior to the base model. As far as the current experimental results are concerned, the T_NER model can realize the training of small-sample and fast transferable NER model.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/oncon56984.2022.10126601
Optimizing the process of Police hotlines
  • Dec 9, 2022
  • Bassel Nasr + 2 more

the usefulness of the data entered by emergency hotlines (911,112) operators can be optimized by an automatic validation system for the quality and accuracy of this information and extracting additional data that may be useful for taking decision. For this, telephone exchanges can be transformed into text (using “Speech to Text”) to extract names of people or companies (aggressors, victims), places, type of crimes or offenses (using “Named Entity Recognition”). On the other hand, as this database grows, the need for intelligent search engine will be mandatory, for fast and intelligent inquiry of useful information. In this paper we did a comparative work, so firstly we focused on the part related to the classification of the Arabic written texts saved in the database of the system where we used different methods of transformation starting by TF-IDF(Term Frequency - Inverse Document Frequency) and word index Tokenizing passing to CBOW(Continuous Bag Of Words), DBOW(Distributed Bag Of Words) and Embedding, then we tested many suitable models as naive Bayesian models, deep neural networks with LSTM(Long Short Term Memory) and Word2vec concepts. At last we compared all to Transformers applying AraBERT(Arabic Bidirectional Encoder Representations for Transformers). And secondly, NER (Named Entity Recognition) model that classifies certain words/sequence of words in the text into the four entities (Suspect, Victim, Crime Location, Crime Date). This was accomplished by training and validating two machine learning algorithms for token classification (or named entity recognition), AraBERT being the model with the most significant results. This NER model was later tested on Arabic crime texts that we have scraped from Facebook, to examine its performance on new and unstructured Arabic data. Another goal achieved by this paper is the similarity searching by using Word2vec model which aims to better find information by applying an unstructured intelligent search that helps decision makers to get relevant intelligence on which to base their choices.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon