The performance of BERT as data representation of text clustering

Alvin Subakti,Nora Hariadi,Hendri Murfi

doi:10.1186/s40537-022-00564-9

Alvin Subakti, Nora Hariadi + Show 1 more

Open Access

https://doi.org/10.1186/s40537-022-00564-9

Copy DOI

Journal: Journal of Big Data	Publication Date: Feb 8, 2022
Citations: 39	License type: open-access

Affiliation: University of Indonesia

Abstract

Text clustering is the task of grouping a set of texts so that text in the same group will be more similar than those from a different group. The process of grouping text manually requires a significant amount of time and labor. Therefore, automation utilizing machine learning is necessary. One of the most frequently used method to represent textual data is Term Frequency Inverse Document Frequency (TFIDF). However, TFIDF cannot consider the position and context of a word in a sentence. Bidirectional Encoder Representation from Transformers (BERT) model can produce text representation that incorporates the position and context of a word in a sentence. This research analyzed the performance of the BERT model as data representation for text. Moreover, various feature extraction and normalization methods are also applied for the data representation of the BERT model. To examine the performances of BERT, we use four clustering algorithms, i.e., k-means clustering, eigenspace-based fuzzy c-means, deep embedded clustering, and improved deep embedded clustering. Our simulations show that BERT outperforms TFIDF method in 28 out of 36 metrics. Furthermore, different feature extraction and normalization produced varied performances. The usage of these feature extraction and normalization must be altered depending on the text clustering algorithm used.

Highlights

Information technology has an essential role in daily human activities and developing very quickly along with the times
The feature extraction and normalization strategies are abbreviated into Max for max pooling, Mean for mean pooling, I for identity normalization, Layer normalization (LN) for layer normalization, N for standard normalization, and Min–max normalization (MM) for min–max normalization.The deviations denote the standard deviation of the metric from 50 repetitions
The performance of representation was evaluated by becoming an input for four text clustering algorithms, namely k-means clustering (KM), eigenspace-based fuzzy c-means (EFCM), improved deep embedded clustering (IDEC), and IDEC

Summary

Introduction

Information technology has an essential role in daily human activities and developing very quickly along with the times. Clustering is one of the tasks often used in digital text, i.e., grouping online news that enable us to find specific information based on the topic being discussed in the news. Grouping news can be done manually by analyzing the text in the news and determining the topics contained in the text. The large number of news available on the internet makes the manual grouping process inefficient. This is because grouping text data manually requires a lot of human resources and consumes a lot of time. Labeling text data requires significant human resources. Due to these two reasons, the unsupervised learning method is suitable for determining groups in text data

Objectives

Methods

Findings

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

The performance of BERT as data representation of text clustering

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data

Lead the way for us

Similar Papers

The impact of feature extraction techniques on the performance of text data classification models
Abdallah Maiti ... Mohamed Hanini
Indonesian Journal of Electrical Engineering and Computer Science | VOL. 35
Abdallah Maiti, et. al.Abdallah Maiti ... Mohamed Hanini
01 Aug 2024
Indonesian Journal of Electrical Engineering and Computer Science | VOL. 35

Bert model fine-tuning for text classification in knee OA radiology reports
L Chen ... V Pedoia
Osteoarthritis and Cartilage | VOL. 28
L Chen, et. al.L Chen ... V Pedoia
01 Apr 2020
Osteoarthritis and Cartilage | VOL. 28

Bidirectional encoders to state-of-the-art: a review of BERT and its transformative impact on natural language processing
Rajesh Gupta
Информатика. Экономика. Управление - Informatics. Economics. Management | VOL. 3
Rajesh GuptaRajesh Gupta
02 Mar 2024
Информатика. Экономика. Управление - Informatics. Economics. Management | VOL. 3

Evaluation of Context-Aware Language Models and Experts for Effort Estimation of Software Maintenance Issues
Mohammed Alhamed ... Tim Storer
-
Mohammed Alhamed, et. al.Mohammed Alhamed ... Tim Storer
01 Oct 2022
01 Oct 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The performance of BERT as data representation of text clustering

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data