FeedRef2022: A Named Entity Recognition Dataset for Extracting Indicators of Compromise

Hsin-Ju Chan,Ching-Chang Chien,Ji-Jie Wu,He-Lin Ku,Chin-Yuan Hsu

doi:10.1109/bigdata55660.2022.10020985

Abstract

With the increasing use of the internet, cyber threats and malicious activities are becoming ubiquitous. To avoid unsuspecting attacks, gathering enough information about different threats is crucial. According to the Pyramid of Pain, Indicators of Compromise (IOCs) are the simplest artifacts to observe, which help cyber security professionals to design the corresponding precautions. Cyber Threat Intelligence (CTI) is data that presents current threat events, threat actors’ targets, and attack behaviors; hence, collecting and analyzing CTI in advance can be beneficial to defend against cyberattacks. In this paper, we construct a named entity recognition dataset using our annotation method by collecting 1,854 threat intelligence reports. Additionally, we fine-tuned four pre-trained language models and compared the efficiency of each model. Among the four models, we realized that the fine-tuned ELECTRA model could extract new IOCs correctly, and the FeedRef2022 dataset could train NER models for detecting IOCs.

Full Text