Study of Word Embeddings for Enhanced Cyber Security Named Entity Recognition

Smita Srivastava,Biswajit Paul,Deepa Gupta

doi:10.1016/j.procs.2023.01.027

Abstract

A vast majority of cyber security information is in the form of unstructured text. A much-needed task is to have a machine-assisted analysis of such information. Named Entity Recognition (NER) provides a vital step towards this conversion. However, cyber security named entities are not restricted to classical entity types like people, location, organisation, miscellaneous etc but comprise a large set of domain-specific entities. Word embedding has emerged as the dominant choice for the initial transfer of semantics to downstream NLP tasks and impacts performance. Though several word embeddings learned using general purpose large corpus like Google News, Wikipedia etc. are available as pre-trained embeddings and have shown good performance on NER tasks; this trend is not consistent when it comes to domain-specific NER. This work explores the relative performances and suitability of prominent word embeddings for cyber security NER task. Embeddings considered include both general-purpose pre-trained word embeddings (non-contextual and contextual) available in the public domain and task-adapted embedding generated by fine-tuning these pre-trained embeddings on a task-specific supervised dataset. The results indicate that when it comes to using pre-trained embeddings for cyber security NER, fastText performs better than GloVe and BERT. However, when embeddings are further fine-tuned for the cyber-NER task, the performance of all the fine-tuned embeddings improved by +2-7%. Further, BERT embedding fine-tuned using position-wise FFN (Feed Forward Network) produced the state-of-the-art 0.974 F1-Score on the cyber security NER dataset.

Full Text