ParsingPhrase: Parsing-based automated quality phrase mining

Yongliang Wu,Shuliang Zhao,Shimao Dou,Jinghui Li

doi:10.1016/j.ins.2023.03.089

Abstract

Phrases represent independent semantics in natural language but usually have indeterminate lengths and different combinations. So, extracting meaningful phrases from unstructured texts will substantially reduce semantic ambiguity and lay the foundation for downstream natural language tasks. Most existing research obtains candidate phrases by N-grams, which includes meaningless word sequences and degrades algorithm performance. In this paper, we propose a novel phrase-mining algorithm, called ParsingPhrase, which effectively extracts combination phrases from text and improves phrase quality by syntactic features. It consists of three stages. Firstly, all sentences in texts are represented as parsing trees by PCFG (Probabilistic Context-Free Grammar). We propose PBMP (Parsing-Based Phrase mining) to obtain candidate phrases from those parsing trees. Then, we introduce a new phrase evaluation indicator, called Significance, that relies on the role of phrases to measure their importance. We integrate the Significance with conventional evaluation indexes for a more reasonable phrase evaluation. Finally, we optimize the phrase quality again by exploiting the optimal phrase composition features of sentences. To the best of our knowledge, it is the first work to employ parsing for combination phrase mining and evaluation, meanwhile offering a solution for syntactic disambiguation. Experiments on three real corpora demonstrate that the ParsingPhrase exceeds state-of-the-art baselines, is 7% higher in the candidate phrase conversion rate, and is 6% better in terms of Precision.

Full Text