Abstract
Automatic text classification using machine learning is significantly affected by the text representation model. The structural information in text is necessary for natural language understanding, which is usually ignored in vector-based representations. In this paper, we present a graph kernel-based text classification framework which utilises the structural information in text effectively through the weighting and enrichment of a graph-based representation. We introduce weighted co-occurrence graphs to represent text documents, which weight the terms and their dependencies based on their relevance to text classification. We propose a novel method to automatically enrich the weighted graphs using semantic knowledge in the form of a word similarity matrix. The similarity between enriched graphs, knowledge-driven graph similarity, is calculated using a graph kernel. The semantic knowledge in the enriched graphs ensures that the graph kernel goes beyond exact matching of terms and patterns to compute the semantic similarity of documents. In the experiments on sentiment classification and topic classification tasks, our knowledge-driven similarity measure significantly outperforms the baseline text similarity measures on five benchmark text classification datasets.
Highlights
Research on automatic text classification has gained importance due to the information overload problem and the need for faster and more accurate extraction of knowledge from huge data sources
Graph-based representations of text are effective for text classification as they can model the structural information in text, which is required to understand its meaning
We focused on building a text graph model that represents the structural information in text effectively, which helps to compare documents based on their main similar content
Summary
Research on automatic text classification has gained importance due to the information overload problem and the need for faster and more accurate extraction of knowledge from huge data sources. Bag-of-words is the most commonly used text representation scheme and is based on term independence assumption, where a text document is regarded as a set of unordered terms and is represented as a vector. We use an edge walk graph kernel to utilise the information in the enriched weighted graphs for calculating the similarity between text documents. The kernel function takes as input a pair of weighted co-occurrence graphs and gives as output a similarity value based on matching relevant content of the text documents. The novel contributions made in this paper are (1) the proposed weighting of the graph, (2) the automatic enrichment of graphs and (3) the application of the new graph-based text representation to build the knowledge-driven similarity measure.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: International Journal of Machine Learning and Cybernetics
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.