Development of a document classification method by using geodesic distance to calculate similarity of documents

Hung Vo-Trung

doi:10.15587/1729-4061.2020.203866

Abstract

Currently, the Internet has given people the opportunity to access to human knowledge quickly and conveniently through various channels such as Web pages, social networks, digital libraries, portals... However, with the process of exchanging and updating information quickly, the volume of information stored (in the form of digital documents) is increasing rapidly. Therefore, we are facing challenges in representing, storing, sorting and classifying documents. In this paper, we present a new approach to text classification. This approach is based on semi-supervised machine learning and Support Vector Machine (SVM). The new point of the study is that instead of calculating the distance between the vectors by Euclidean distance, we use geodesic distance. To do this, the text must first be expressed as an n-dimensional vector. In the n-dimensional vector space, each vector is represented by one point; use geodesic distance to calculate the distance from a point to nearby points and connect into a graph. The classification is based on calculating the shortest path between vertices on the graph through a kernel function. We conducted experiments on articles taken from Reuters on 5 different topics. To evaluate the proposed method, we tested the SVM method with the traditional calculation based on Euclidean distance and the method we proposed based on geodesic distance. The experiment was performed on the same data set of 5 topics: Business, Markets, World, Politics, and Technology. The results showed that the correct classification rate is better than the traditional SVM method based on Euclidean distance (average of 3.2 %)

Highlights

The classification is an important step to make the processing more efficient through the processing of a smaller group of documents instead of having to deal with the entire block of documents
To evaluate the proposed method, we tested the Support Vector Machine (SVM) method with the traditional calculation based on Euclidean distance and the method we proposed based on geodesic distance
We propose the kernel function of the vector support machine using geodesic distance combined with Gauss function

Summary

Introduction

The classification is an important step to make the processing more efficient through the processing of a smaller group of documents (after classification) instead of having to deal with the entire block of documents. There are many methods of text classification and most of them are based on machine learning techniques [1]. To classify a text based on machine learning, the volume and quality of the text are used to train the system to create a good classification model that is extremely important, deciding on the quality of the text classification system. Building data warehouses for developing machine learning-based text classification applications is often quite expensive and less available, especially low user languages. Instead of using a Supervised Machine Learning method, people often use semi-supervised machine learning method so that they do not need a large amount of training data (labeled text) during classification. Non-Supervised Machine Learning method is rarely used because the classification quality is not high and the speed is low [2]

Objectives

Methods

Findings

Conclusion