ZoomNet for Topic-Oriented Fragment Recognition in Long Documents

Yukun Yan,Zhengdong Lu,Sen Song,Daqi Zheng

doi:10.1109/access.2022.3166235

Yukun Yan, Zhengdong Lu + Show 2 more

Open Access

https://doi.org/10.1109/access.2022.3166235

Copy DOI

Abstract

This work introduces a new information extraction task called <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Topic-Oriented Fragment Recognition (TOFR)</b> , whose goal is to recognize information related to a specific topic in long documents from professional fields. In this paper, we introduce two TOFR datasets to study the problems of processing long documents. We propose a novel neural framework named <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Zooming Network (ZoomNet)</b> , which overcomes the challenge of combining information over long distances with limited computing resources by flexibly switching between skimming and intensive reading in processing long documents. In general, ZoomNet first establishes a hierarchical representation aligned to the text structure, which relieves the conflict between local information and extensive contextual information. Then, it synthesizes different levels of information to assign tags via multi-scale actions. We combine supervised and reinforcement learning methods to train our model. Experiments show that the proposed model outperforms several state-of-the-art sequence labeling models, including BiLSTM-CRF, BERT, XLNET, RoBERTa, and ELECTRA, on both TOFR datasets with big margins.

Full Text