Abstract

Topological data analysis (TDA) is a branch of mathematics that analyzes the shape of high-dimensional data sets using geometry and algebra. TDA is used for data visualization which represents the relationship among elements using a network. Traditionally, TDA is quadratic in complexity and not commonly used for natural language processing. In this research, we visualize the relationship among words in a text block, words in a corpus and text blocks in a corpus. Text block represents a unit of a corpus such as, a web page in a web corpus, a chapter or section in a book corpus or a document in media corpus. This research proposes circular topology for representing words both for Local Context (LC) and Global Context (GC). Each text block is a set of sentences forming the LC. We found that feature words are extracted successfully from our LC analysis. The occurrence of extracted featured words in the corpus formed the GC. We evaluate this proposed simplified topological analysis on 3 different corpora: a single book corpus, a book corpus consisting of 7 books having 6020 narrations and a web corpus consisting of 990 web pages. The peripheral nature of the LC reduced the vocabulary size of the corpus significantly in O(nm) time where n is the number of text blocks and m is number of nouns in a sentence. GC analysis of featured words reflected useful properties of featured word movement which can be used to analyze topic evolution. GC analysis of text block points is aimed to find closely related text blocks in a radius. This reflected interesting results that need further supervised investigation. Research on topology driven natural language processing is in its infancy. This article contributes to this research field by introducing a method motivated by TDA to represent and visualize the peripheral nature of text block and corpus, by achieving success in dimensional reduction using local analysis and by simplifying the approach of complex topological analysis through localization.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call