A semantic relatedness preserved subset extraction method for language corpora based on pseudo-Boolean optimization

Luobing Dong,Qiumin Guo,Weili Wu,Meghana N Satpute

doi:10.1016/j.tcs.2020.07.020

Abstract

As language corpora have been playing an increasingly important role in the field of Artificial Intelligence (AI) research, lots of extremely large corpora are created. However, a larger corpora size not only increases power and accuracy but also brings redundancy. Therefore, researchers began to emphasize the study of appropriate subset extraction methods. Due to the trade-off between data sufficiency and redundancy, a group of interesting and challenging problems are emerged that are studied in this paper: (1) How to make the resulting subset include as much data as possible under some necessary constraints? (2) How to preserve the potential useful semantic relatedness included in the original corpora while reducing the size of the corpora? For these two problems, existing work mainly focuses on the methods to construct particular subsets for special usage. These methods are limited in their focus. In this paper, we try to address the problems listed above. First, considering the cubic and binary semantic relatedness among tokens, we construct a general system model and formulate the mix problem as a cubic pseudo-Boolean optimization problem. Then, by analyzing the characteristics of the objective function, we transfer the problem into the maximum flow problem of a corresponding graph. Third, we propose a new algorithm by introducing discrete Lagrangian iteration method. We prove that the objective function is supermodular, which allows us to use fast minimum cut algorithms in each iteration step to propose another fast algorithm. Finally, we experimentally validate our new algorithms on several randomly created corpora.

Full Text