Abstract
We propose a new approach to text semantic analysis and general corpus analysis using, as termed in this article, a "bi-gram graph" representation of a corpus. The different attributes derived from graph theory are measured and analyzed as unique insights or against other corpus graphs, attributes such as the graph chromatic number and the graph coloring, graph density and graph K-core. We observe a vast domain of tools and algorithms that can be developed on top of the graph representation; creating such a graph proves to be computationally cheap, and much of the heavy lifting is achieved via basic graph calculations. Furthermore, we showcase the different use-cases for the bi-gram graphs and how scalable it proves to be when dealing with large datasets.
Highlights
Corpus representation is central to natural language processing
The K-core dimensionality reduction and noise reduction approach we proposed in section 3.2 demonstrated below shows outstanding results when used in classification-based machine learning pipelines
In the example shown below, a corpus of spam and ham SMSes1 was taken as a classic example for text classification in machine learning natural language processing; the corpus was cleared of stopwords and converted into a bag of words representation
Summary
Corpus representation is central to natural language processing. This paper highlights the benefits and use cases of a representation based on inner word relationships derived from the bi-grams of a given corpus. Previous works that used similar methods revolve around solving a specific problem using a graph representation (Masséet al, 2008), suggested a new way for grounding the meanings of certain words in sensorimotor categories (Reimer and Hahn, 1988), proposed a model of knowledge-based text condensation that resembles today's well-known knowledge-graphs. Many graph attributes are left untouched in natural language processing due to the different representations available. Previous works highlighted the benefits of N-gram flexibility with the well-structured representation of directed graphs and their applications towards classification problems, (Violos et al, 2018)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.