Abstract

Multidocument summarization problem deals with extracting main information and ideas from a set of related documents. Solution to this problem is to find an extraction strategy that aims at finding a small subset of sentences that is able to cover the most important information about the whole document set. Although a large number of machine-learning-based methods have shown great promise, the lack of high-quality training data poses an inherent obstacle to them. Furthermore, because of the proliferation of low-quality documents on the Internet, the existing summarization strategies, which are merely based on statistical features, get poor performance. In this article, we propose a new two-phase multidocument summarization strategy using content attention-based subtopic detection. First, inspired by distance dynamics-based community detection mechanism, we extract subtopics from the set of documents by having insight into their own content attention and also underlying semantic relations. Instead of complicated neural attention mechanisms, we propose a simple iteration-based content attention method to complete the subtopic detection task. Second, we formulate summarization from different subtopics as a combinatorial optimization problem of minimizing sentence distance and maximizing topic diversity. We prove the submodularity of the above optimization problem, which allows us to propose a new multidocument summarization algorithm based on the greedy mechanism. Finally, we experimentally validate our new algorithms on BBC news summary and wikiHow data. The results show our new algorithms outperform the state-of-the-art methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.