Abstract

Automatic topic labelling aims to generate coherent, interpretable, and meaningful labels to facilitate the interpretation of various topics within a corpus of documents. Typically, we represent a topic as a list of terms and documents ranked according to probability. In our research, we employ Top2Vec for topic modelling. We propose a novel three-phase, zero-shot topic labelling framework leveraging the ConceptNet knowledge graph (a comprehensive semantic network of words and phrases) and pre-trained language models as external sources of information. The first phase enriches the top n words of a given topic (based on probability) by expanding their neighbourhood in ConceptNet, bridging missing connections and information gaps, and subsequently yielding a semantically enhanced set of potential labels through the generation of a sub-graph. The second phase constructs a neighbourhood graph (organized as a subgraph of ConceptNet) for each candidate label, evaluating each node’s semantic similarity to the topic and retaining the best sub-graph according to semantic similarity. In this phase, a one-word label is extracted from the final graph, concisely representing the topic. In the third phase, we use the language model to derive the final labels by taking the optimal graph as input. In the third phase, we utilize the language model to derive the sentence and summary labels using the optimal graph as input. These additional labels offer more comprehensive and contextually rich topic representations, facilitating more profound understanding and interpretation. By harnessing the power of knowledge graphs and language models, our framework expands the knowledge beyond the topic documents, optimizing the discovered topics with more representative terms while preserving the topic information. The proposed zero-shot approach (employing pre-trained language models and the ConceptNet knowledge graph without additional training) alleviates computational burdens. It reduces the cognitive and interpretative load on end-users by generating three types of labels for each topic: a one-word, sentence, and summary. Experimental results demonstrate that our model significantly surpasses unsupervised baselines and traditional topic labelling models while remaining competitive with supervised baselines in topic labelling performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call