Document Clustering Techniques Research Articles

Humans must easily handle the vast amounts of data being generated by the revolution of information technology. Thus, Automatic Text summarization has been applied to various domains in order to find the most relevant information and make critical decisions quickly. In the context of Arabic, text summarization techniques suffer from several problems. First, most existing methods do not consider the context or domain to which the document belongs. Second, the majority of the existing approaches are based on the traditional bag-of-words representation, which involves high dimensional and sparse data, and makes it difficult to capture relevant information. Third, research in Arabic Text summarization is fairly small and only recently compared to that on Anglo-Saxon and other languages due to the shortage of Arabic corpora, resources, and automatic processing tools. In this paper, we try to overcome these limitations by proposing a new approach using documents clustering, topic modeling, and unsupervised neural networks in order to build an efficient document representation model. First, a new document clustering technique using Extreme learning machine is performed on large text collection. Second, topic modeling is applied to documents collection in order to identify topics present in each cluster. Third, each document is represented in a topic space by a matrix where rows represent the document sentences and columns represent the cluster topics. The generated matrix is then trained using several unsupervised neural networks and ensemble learning algorithms in order to build an abstract representation of the document in the concept space. Important sentences are ranked and extracted according to a graph model with a redundancy elimination component. The proposed approach is evaluated on Essex Arabic Summaries Corpus and compared against other Arabic text summarization approaches using ROUGE measure. Experimental results showed that the models trained on topic representation learn better representations and improve significantly the summarization performance. In particular, ensemble learning models demonstrated an important improvement on Rouge recall and promising results on F-measure.

Read full abstract

Document clustering is the partitioning of a given collection of documents into various K- groups based on some similarity/dissimilarity criterion. This task has applications in scope detection of journals/conferences, development of some automated peer-review support systems, topic-modeling, latest cognitive-inspired works on text summarization, and classification of documents based on semantics, etc. In the current paper, a cognitive-inspired multi-objective automatic document clustering technique is proposed which is a fusion of self-organizing map (SOM) and multi-objective differential evolution approach. The variable number of cluster centers are encoded in different solutions of the population to determine the number of clusters from a data set in an automated way. These solutions undergo various genetic operations during evolution. The concept of SOM is utilized in designing new genetic operators for the proposed clustering technique. In order to measure the goodness of a clustering solution, two cluster validity indices, Pakhira-Bandyopadhyay-Maulik index, and Silhouette index, are optimized simultaneously. The effectiveness of the proposed approach, namely self-organizing map based multi-objective document clustering technique (SMODoc_clust) is shown in automatic classification of some scientific articles and web-documents. Different representation schemas including tf, tf-idf and word-embedding are employed to convert articles in vector-forms. Comparative results with respect to internal cluster validity indices, namely, Dunn index and Davies-Bouldin index, are shown against several state-of-the-art clustering techniques including three multi-objective clustering techniques namely MOCK, VAMOSA, NSGA-II-Clust, single objective genetic algorithm (SOGA) based clustering technique, K-means, and single-linkage clustering. Results obtained clearly show that our approach is better than existing approaches. The validation of the obtained results is also shown using statistical significant t tests.

Read full abstract

Document Clustering Techniques Research Articles

Related Topics

Articles published on Document Clustering Techniques

Document Clustering in the Age of Big Data: Incorporating Semantic Information for Improved Results

Identification and Visualization of Key Topics in Scientific Publications with Transformer-Based Language Models and Document Clustering Methods

Construction of suffix tree using key phrases for document using down-top incremental conceptual hierarchical text clustering approach

Comparing the Performance of SOM with Traditional Methods for Document Clustering Using Wordnet Ontologies

Automatic Labeling of Clusters for a Low-Resource Urdu Language

Analisis Dan Implementasi Algoritma Active Fuzzy Constrained Clustering Untuk Pengelompokan Dokumen

Application of Convolution Neural Networks in Web Search Log Mining for Effective Web Document Clustering

Automatic Text Summarization using Document Clustering Named Entity Recognition

Knowledge tracing: A bibliometric analysis

Retracted] Big Data Analytics Model for Distributed Document Using Hybrid Optimization with K‐Means Clustering

Stamantic clustering: Combining statistical and semantic features for clustering of large text datasets

Unsupervised neural networks for automatic Arabic text summarization using document clustering and topic modeling

Semantics-based clustering approach for similar research area detection

Role of Pre-processing Phase in Document Clustering Technique for Gurmukhi Script

Performance Exploration on Various Document Clustering Techniques with K-Means Family

Development of Document Clustering Technique for Gurmukhi Script using Fuzzy Term Weight

Critical Analysis of Clinical Document Clustering Technique with Special Reference to Non-Matrix Factorization

Automatic Scientific Document Clustering Using Self-organized Multi-objective Differential Evolution

Embedded Fuzzy Bilingual Dictionary model for cross language information retrieval systems

Clustering News Articles using Efficient Similarity Measure and N-grams

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Document Clustering Techniques Research Articles

Related Topics

Articles published on Document Clustering Techniques

Document Clustering in the Age of Big Data: Incorporating Semantic Information for Improved Results

Identification and Visualization of Key Topics in Scientific Publications with Transformer-Based Language Models and Document Clustering Methods

Construction of suffix tree using key phrases for document using down-top incremental conceptual hierarchical text clustering approach

Comparing the Performance of SOM with Traditional Methods for Document Clustering Using Wordnet Ontologies

Automatic Labeling of Clusters for a Low-Resource Urdu Language

Analisis Dan Implementasi Algoritma Active Fuzzy Constrained Clustering Untuk Pengelompokan Dokumen

Application of Convolution Neural Networks in Web Search Log Mining for Effective Web Document Clustering

Automatic Text Summarization using Document Clustering Named Entity Recognition

Knowledge tracing: A bibliometric analysis

Retracted] Big Data Analytics Model for Distributed Document Using Hybrid Optimization with K‐Means Clustering

Stamantic clustering: Combining statistical and semantic features for clustering of large text datasets

Unsupervised neural networks for automatic Arabic text summarization using document clustering and topic modeling

Semantics-based clustering approach for similar research area detection

Role of Pre-processing Phase in Document Clustering Technique for Gurmukhi Script

Performance Exploration on Various Document Clustering Techniques with K-Means Family

Development of Document Clustering Technique for Gurmukhi Script using Fuzzy Term Weight

Critical Analysis of Clinical Document Clustering Technique with Special Reference to Non-Matrix Factorization

Automatic Scientific Document Clustering Using Self-organized Multi-objective Differential Evolution

Embedded Fuzzy Bilingual Dictionary model for cross language information retrieval systems

Clustering News Articles using Efficient Similarity Measure and N-grams