An LSTM model for extracting hierarchical relations between words for better topic modeling

Arshad Javeed

doi:10.1088/1742-6596/1780/1/012019

Abstract

Often when dealing with text data, there exists valuable information that determines the relationship between the words encountered in the corpus. The type of relationship which is sought after is the “has-a” and “is-a” relationship, with which one can build a hierarchical representation of words. Since each language has its own set of rules and syntax, extraction of the relationships ultimately boils down to understanding the syntax of the particular language and using relevant features in the process.The paper presents a machine-learning model for understanding the language syntax and deducing the relationships between the words encountered. To be specific, a sequence modeling approach if followed, where the model receives a sequence of words and makes use of the various properties of the words to build a hierarchical graph. The algorithm described will be independent of the language, and the model should be versatile enough to be trained for different languages. In addition, the paper also describes how this information can be used to build better topic models, given a corpus of text.

Full Text