Improving Online Clustering of Chinese Technology Web News With Bag-of-Near-Synonyms

Zhe Zhang,Le Chen,Lixiang Guo,Xin Zhang,Fengjing Yin

doi:10.1109/access.2020.2995516

Abstract

In the Internet era, online clustering of technology web news can help discover scientific breakthroughs and grasp technology trends. To do that automatically, the news documents to be clustered must be represented appropriately with numerical vectors. However, traditional representations such as Term Frequency-Inverse Document Frequency (TF-IDF) cannot distinguish near-synonyms and may cause “dimension disaster.” To overcome these problems, this article proposes the Bag-of-Near-Synonyms (BoNS) model based on the idea to construct near-synonym sets using word embeddings and agglomerative clustering, and then to represent a document with a Set Frequency-Inverse Document Frequency (SF-IDF) vector in which each dimension corresponds to a near-synonym set rather than a single word. To speed up computation, we further propose the hashed version of SF-IDF and name it hSF-IDF, which employs a hash function to map each near-synonym set to a unique number as the key and hence reduces the computation of SF to linear time. In addition, we apply hSF-IDF to online clustering of Chinese technology web news and propose an improved batch-based method. Extensive experiments have been conducted on a real-world dataset. The results show that our model outperforms some strong baselines including TF-IDF, average pooling of word or character embeddings, Latent Dirichlet Allocation (LDA), and bag-of-concepts in terms of both accuracy and efficiency.

Highlights

Technology web news often reports latest inventions and the launches of new products
We present a new document representation named BoNS, which uses an Set Frequency-Inverse Document Frequency (SF-IDF) vector to encode a document, and each dimension of the vector corresponds to a near-synonym set obtained through agglomerative clustering, but not to a single word
We further propose a hashed version of set frequency (SF)-IDF and name it hashed SF-IDF (hSF-IDF), which utilizes a hash function to map each near-synonym set to a unique number as the key and uses it as a prefix for each in-set word

Summary

Introduction

Technology web news often reports latest inventions and the launches of new products. Analyzing them can help discover scientific breakthroughs and grasp technology trends. Online clustering techniques are widely used in existing approaches, e.g., [1]–[3], due to two facts: (1) web news is a kind of streaming data, and (2) it covers various domains and often involves emerging things, which makes it difficult, if not impossible, to pre-specify the required features like keywords or to learn them from given data samples. Online clustering of news streams is typically conducted in an incremental manner. For each incoming news document, online clustering consists of three main steps [4].

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Improving Online Clustering of Chinese Technology Web News With Bag-of-Near-Synonyms

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

A Character-Enhanced Chinese Word Embedding Model
Gang Yang ... Tianhao He
-
Gang Yang, et. al.Gang Yang ... Tianhao He
01 Jul 2019
01 Jul 2019

On Character vs Word Embeddings as Input for English Sentence Classification
James Hammerton ... Michele Sama
-
James Hammerton, et. al.James Hammerton ... Michele Sama
09 Nov 2018
09 Nov 2018

Identification of Traffic Information on Twitter Data using Topic Modeling and Entity Recognition
...
-
, et. al. ...
26 Apr 2021
26 Apr 2021

Detecting negation and scope in Chinese clinical notes using character and word embedding
Tian Kang ... Jianbo Lei
Computer Methods and Programs in Biomedicine | VOL. 140
Tian Kang, et. al.Tian Kang ... Jianbo Lei
23 Nov 2016
Computer Methods and Programs in Biomedicine | VOL. 140

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improving Online Clustering of Chinese Technology Web News With Bag-of-Near-Synonyms

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access