Abstract

Text clustering is the task of grouping a set of texts so that text in the same group will be more similar than those from a different group. The process of grouping text manually requires a significant amount of time and labor. Therefore, automation utilizing machine learning is necessary. One of the most frequently used method to represent textual data is Term Frequency Inverse Document Frequency (TFIDF). However, TFIDF cannot consider the position and context of a word in a sentence. Bidirectional Encoder Representation from Transformers (BERT) model can produce text representation that incorporates the position and context of a word in a sentence. This research analyzed the performance of the BERT model as data representation for text. Moreover, various feature extraction and normalization methods are also applied for the data representation of the BERT model. To examine the performances of BERT, we use four clustering algorithms, i.e., k-means clustering, eigenspace-based fuzzy c-means, deep embedded clustering, and improved deep embedded clustering. Our simulations show that BERT outperforms TFIDF method in 28 out of 36 metrics. Furthermore, different feature extraction and normalization produced varied performances. The usage of these feature extraction and normalization must be altered depending on the text clustering algorithm used.

Highlights

  • Information technology has an essential role in daily human activities and developing very quickly along with the times

  • The feature extraction and normalization strategies are abbreviated into Max for max pooling, Mean for mean pooling, I for identity normalization, Layer normalization (LN) for layer normalization, N for standard normalization, and Min–max normalization (MM) for min–max normalization.The deviations denote the standard deviation of the metric from 50 repetitions

  • The performance of representation was evaluated by becoming an input for four text clustering algorithms, namely k-means clustering (KM), eigenspace-based fuzzy c-means (EFCM), improved deep embedded clustering (IDEC), and IDEC

Read more

Summary

Introduction

Information technology has an essential role in daily human activities and developing very quickly along with the times. Clustering is one of the tasks often used in digital text, i.e., grouping online news that enable us to find specific information based on the topic being discussed in the news. Grouping news can be done manually by analyzing the text in the news and determining the topics contained in the text. The large number of news available on the internet makes the manual grouping process inefficient. This is because grouping text data manually requires a lot of human resources and consumes a lot of time. Labeling text data requires significant human resources. Due to these two reasons, the unsupervised learning method is suitable for determining groups in text data

Objectives
Methods
Findings
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.