Noisy annotations have become a key issue limiting Document-level Relation Extraction (DocRE). Previous research explored the problem through manual re-annotation. However, the handcrafted strategy is of low efficiency, incurs high human costs and cannot be generalized to large-scale datasets. To address the problem, we construct a confidence-based Revision framework for DocRE (ReD), aiming to achieve high-quality automatic data revision. Specifically, we first introduce a denoising training module to recognize relational facts and prevent noisy annotations. Second, a confidence-based data revision module is equipped to perform adaptive data revision for long-tail distributed relational facts. After the data revision, we design an iterative training module to create a virtuous cycle, which transforms the revised data into useful training data to support further revision. By capitalizing on ReD, we propose ReD-DocRED, which consists of 101,873 revised annotated documents from DocRED. ReD-DocRED has introduced 57.1% new relational facts, and concurrently, models trained on ReD-DocRED have achieved significant improvements in F1 scores, ranging from 6.35 to 16.55. The experimental results demonstrate that ReD can achieve high-quality data revision and, to some extent, replace manual labeling.11The ReD-DocRED is available at https://github.com/jc4357/ReD-DocRED.
Read full abstract