Automated Text Clustering of Newspaper and Scientific Texts in Brazilian Portuguese: Analysis and Comparison of Methods

Alexandre Ribeiro Afonso,Cl�Udio Gottschal Duque

doi:10.4301/s1807-17752014000200011

Alexandre Ribeiro Afonso, Cl�Udio Gottschal Duque

Open Access

https://doi.org/10.4301/s1807-17752014000200011

Copy DOI

Abstract

This article reports the findings of an empirical study about Automated Text Clustering applied to scientific articles and newspaper texts in Brazilian Portuguese, the objective was to find the most effective computational method able to cluster the input of texts in their original groups. The study covered four experiments, each experiment had four procedures: 1. Corpus Selections (a set of texts is selected for clustering), 2. Word Class Selections (Nouns, Verbs and Adjectives are chosen from each text by using specific algorithms), 3. Filtering Algorithms (a set of terms is selected from the results of the preview stage, a semantic weight is also inserted for each term and an index is generated for each text), 4. Clustering Algorithms (the clustering algorithms Simple K-Means, sIB and EM are applied to the indexes). After those procedures, clustering correctness and clustering time statistical results were collected. The sIB clustering algorithm is the best choice for both scientific and newspaper corpus, under the condition that the sIB clustering algorithm asks for the number of clusters as input before running (for the newspaper corpus, 68.9% correctness in 1 minute and for the scientific corpus, 77.8% correctness in 1 minute). The EM clustering algorithm additionally guesses the number of clusters without user intervention, but its best case is less than 53% correctness. Considering the experiments carried out, the results of human text classification and automated clustering are distant; it was also observed that the clustering correctness results vary according to the number of input texts and their topics.

Highlights

Automated text clustering systems have been developed and tested as an experimental and scientific activity
An automatic text clustering process could be divided in four main stages: Corpus Selection, Word Class Selections, Filtering Algorithms and Clustering Algorithms; during the experiments, we applied different procedures for each stage described to find the best combination of procedures which produced correct textual clustering by consuming less time, both for newspapers and scientific texts in Brazilian Portuguese
We used an additional metric for measuring the number of deviated clusters, a Deviation Number (DN) which identifies the exact number of clusters created more than the expected number of clusters or less than the expected number of clusters

Summary

Introduction

Automated text clustering systems have been developed and tested as an experimental and scientific activity. An automatic text clustering process could be divided in four main stages: Corpus Selection, Word Class Selections, Filtering Algorithms and Clustering Algorithms; during the experiments, we applied different procedures for each stage described to find the best combination of procedures which produced correct textual clustering by consuming less time, both for newspapers and scientific texts in Brazilian Portuguese. Clustering Algorithms have been developed for general use, for all languages, but they have been tested mainly for English, and many studies about text clustering, using corpora in English as input, have been described over the last decade. Different natural languages could produce different levels of correctness in clustering results, since each natural language has specific structures and properties (such as morphological and syntax peculiarities) with different levels of complexity in their use (number of repetitions of words in newspaper texts, number of synonyms, use of idiomatic expressions, and terminologies)

Objectives

Results

Conclusion