Data-driven feature word selection for clustering online news comments

Heeryon Cho Heeryon Cho,Jong-Seok Lee Jong-Seok Lee

doi:10.1109/bigcomp.2016.7425977

Abstract

Popular news articles attract thousands of online comments, making it tedious and time-consuming for a manual review. Automatically clustering similar comments can help reduce the burden of manual analyses, but appropriate feature words must be selected for successful clustering. In this paper, we present a data-driven feature word selection method which realizes structurally superior clustering of online comments. The top 1,000 most frequent nouns appearing across the entire 7.44 million Korean online comments are selected to construct an overall noun set. Frequent nouns in the online comments of each news article are selected to construct the local noun set. The intersection between the local and overall noun set is taken to construct the global noun set. The global noun set is removed from the corresponding local noun set to construct the distinct noun set. The top 250 most frequent nouns are selected for each of the local, global, and distinct noun sets for K-means clustering. The clustered results are evaluated using three internal cluster validation indices, Dunn, PBM, and Silhouette. As a result, online comments clustered using distinct nouns produced structurally superior clusters when compared to the other types of nouns, local and global.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Data-driven feature word selection for clustering online news comments

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

On Light Nouns
Raffaele Simone ... Francesca Masini
-
Raffaele Simone, et. al.Raffaele Simone ... Francesca Masini
01 Jan 2014
01 Jan 2014

Locative Constructions in Dangme
Regina Oforiwah Caesar
Issues in Language Studies | VOL. 9
Regina Oforiwah CaesarRegina Oforiwah Caesar
03 Dec 2020
Issues in Language Studies | VOL. 9

Text Similarity Measures in News Articles by Vector Space Model Using NLP
Ritika Singh ... Satwinder Singh
Journal of The Institution of Engineers (India): Series B | VOL. 102
Ritika Singh, et. al.Ritika Singh ... Satwinder Singh
07 Nov 2020
Journal of The Institution of Engineers (India): Series B | VOL. 102

A Corpus-Based Study of Similes in British and American English
Riyad F Hussein ... Majdi Sawalha
SSRN Electronic Journal | VOL. 7
Riyad F Hussein, et. al.Riyad F Hussein ... Majdi Sawalha
30 Jul 2016
SSRN Electronic Journal | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Data-driven feature word selection for clustering online news comments

Abstract

Talk to us

Similar Papers