A New Text Classification Model Based on Contrastive Word Embedding for Detecting Cybersecurity Intelligence in Twitter

Han-Sub Shin,Seung-Jin Ryu,Hyuk-Yoon Kwon

doi:10.3390/electronics9091527

Han-Sub Shin, Seung-Jin Ryu + Show 1 more

Open Access

https://doi.org/10.3390/electronics9091527

Copy DOI

Abstract

Detecting cybersecurity intelligence (CSI) on social media such as Twitter is crucial because it allows security experts to respond cyber threats in advance. In this paper, we devise a new text classification model based on deep learning to classify CSI-positive and -negative tweets from a collection of tweets. For this, we propose a novel word embedding model, called contrastive word embedding, that enables to maximize the difference between base embedding models. First, we define CSI-positive and -negative corpora, which are used for constructing embedding models. Here, to supplement the imbalance of tweet data sets, we additionally employ the background knowledge for each tweet corpus: (1) CVE data set for CSI-positive corpus and (2) Wikitext data set for CSI-negative corpus. Second, we adopt the deep learning models such as CNN or LSTM to extract adequate feature vectors from the embedding models and integrate the feature vectors into one classifier. To validate the effectiveness of the proposed model, we compare our method with two baseline classification models: (1) a model based on a single embedding model constructed with CSI-positive corpus only and (2) another model with CSI-negative corpus only. As a result, we indicate that the proposed model shows high accuracy, i.e., 0.934 of F1-score and 0.935 of area under the curve (AUC), which improves the baseline models by 1.76∼6.74% of F1-score and by 1.64∼6.98% of AUC.

Highlights

Twitter is a representative social media where users write their opinions and share events.It is known that more than 150 million users are wrote more than 500 million tweets per day as of 2019 [1]
In the text classification based on deep learning, once the text corpus is represented in the embedding model, its outputs are fed into the classifier based on deep learning
We propose a new text classification model for detecting cybersecurity intelligence in Twitter based on a novel embedding model, contrastive word embedding

Summary

Introduction

Twitter is a representative social media where users write their opinions and share events.It is known that more than 150 million users are wrote more than 500 million tweets per day as of 2019 [1]. The performance of deep learning models tends to be degraded when only a target data set is used as the corpus for training the embedding model because the training data set cannot encompass all the characteristics of the target class due to its imbalanced characteristic [29] To resolve this imbalance of a target data set, a method of incorporating the background knowledge, which utilizes the external reliable data set for supplementing the target class, has been proposed [30,31]. NVD Wikipedia CVE Data Set. We acquire 1935 total accounts who have been written tweets contained in the “exploit” category analyzed by Recorded Future. The total number of keywords included in the RF keyword set is 639, and the examples are as follows: “Internet-security”,

Objectives

Methods

Results

Discussion

Conclusion