Abstract

Neural text matching models have been widely used in community question answering, information retrieval, and dialogue. However, these models designed for short texts cannot well address the long-form text matching problem, because there are many contexts in long-form texts can not be directly aligned with each other, and it is difficult for existing models to capture the key matching signals from such noisy data. Besides, these models are computationally expensive for simply use all textual data indiscriminately. To tackle the effectiveness and efficiency problem, we propose a novel hierarchical noise filtering model, namely Match-Ignition. The main idea is to plug the well-known PageRank algorithm into the Transformer, to identify and filter both sentence and word level noisy information in the matching process. Noisy sentences are usually easy to detect because previous work has shown that their similarity can be explicitly evaluated by the word overlapping, so we directly use PageRank to filter such information based on a sentence similarity graph. Unlike sentences, words rely on their contexts to express concrete meanings, so we propose to jointly learn the filtering and matching process, to well capture the critical word-level matching signals. Specifically, a word graph is first built based on the attention scores in each self-attention block of Transformer, and key words are then selected by applying PageRank on this graph. In this way, noisy words will be filtered out layer by layer in the matching process. Experimental results show that Match-Ignition outperforms both SOTA short text matching models and recent long-form text matching models. We also conduct detailed analysis to show that Match-Ignition efficiently captures important sentences and words, to facilitate the long-form text matching process.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.