Automatic identification of cited text spans: a multi-classifier approach over imbalanced dataset

Shutian Ma,Chengzhi Zhang,Jin Xu

doi:10.1007/s11192-018-2754-2

Abstract

Recently, a new form of structured summary on scientific papers is explored by grouping cited text spans from the reference paper. Its primary goal is to generate summaries based on the cited paper itself. Previously, traditional scientific summarization focused on citation-based methods by aggregating all citances that cite one unique paper without doing content-based citation analysis, while sometimes citations might differ between researchers or time slots. By investigating original text spans where scholars cited, the new method can reflect exact contributions of reference papers more. Therefore, how to identify cited text spans accurately becomes the first important problem to solve. Generally, it can be converted into finding the sentences in reference paper that is more similar with citation sentences. Taking it as a classification task, we investigate the potential of four actions to improve identification performance. Firstly, feature selections are conducted carefully according to multi-classifiers. Secondly, we apply sampling-based algorithms to preprocess class-imbalanced datasets. Since we integrated results via a weighted voting system, the third action is tuning parameters like, voting weights for multi-classifiers integration or running settings to see if we can improve performance further. Evaluation results show effectiveness of each action and demonstrate that researchers can take these actions for more accurate cited text spans identification when doing scientific summarization.

Full Text