Abstract

Topological Data Analysis (TDA) refers to a collection of methods that find the structure of shapes in data. Although recently, TDA methods have been used in many areas of data mining, it has not been widely applied to text mining tasks. In most text processing algorithms, the order in which different entities appear or co-appear is being lost. Assuming these lost orders are informative features of the data, TDA may play a significant role in the resulted gap on text processing state of the art. Once provided, the topology of different entities through a textual document may reveal some additive information regarding the document that is not reflected in any other features from conventional text processing methods. In this paper, we introduce a novel approach that hires TDA in text processing in order to capture and use the topology of different same-type entities in textual documents. First, we will show how to extract some topological signatures in the text using persistent homology-i.e., a TDA tool that captures topological signature of data cloud. Then we will show how to utilize these signatures for text classification.

Highlights

  • A common approach in Topological Data Analysis (TDA) is to capture the shape or the underlying structure of shapes in data

  • In these works usually persistent homology is hired to study the changes in the topology of d-dimensional time series or the delay embedding of 1-dimensional time series

  • We propose a novel method that uses persistent homology to predict the author only based on the graph of the main characters in the novel

Read more

Summary

Introduction

A common approach in Topological Data Analysis (TDA) is to capture the shape or the underlying structure of shapes in data. TDA is been considered to deal with high-dimensional noisy data sets. The common approach is to capture the shapes as the main characteristics of data and dismiss the rest as noise or irrelevant information. Wherever the shape and/or the structure of shapes in data is worth-full, TDA may provide reasonable solutions. Data cloud is often viewed in the form of Simplicial Complexes. A subset consisted of (k + 1) data point is called an k-simplex. A simplicial complex is a set of the simplices.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call