Distance Metrics in Open-Set Classification of Text Documents by Local Outlier Factor and Doc2Vec

Tomasz Walkowiak,Henryk Maciejewski,Szymon Datko

doi:10.1007/978-3-030-22999-3_10

Abstract

In this paper, we investigate the influence of distance metrics on the results of open-set subject classification of text documents. We utilize the Local Outlier Factor (LOF) algorithm to extend a closed-set classifier (i.e. multilayer perceptron) with an additional class that identifies outliers. The analyzed text documents are represented by averaged word embeddings calculated using the fastText method on training data. Conducting the experiment on two different text corpora we show how the distance metric chosen for LOF (Euclidean or cosine) and a transformation of the feature space (vector representation of documents) both influence the open-set classification results. The general conclusion seems to be that the cosine distance outperforms the Euclidean distance in terms of performance of open-set classification of text documents.

Full Text