This work presents the task of text polishing, which generates a sentence that is more graceful than the input sentence while retaining its semantic meaning. Text polishing has great value in real usage and is an important component in modern writing assistance systems. However, the task is still not well studied in the literature. Further research in this important direction requires more formal task definitions, benchmark datasets, and powerful baseline models. In this work, we formulate the task as a context-dependent text generation problem and conduct a case study on the text polishing with Chinese idiom. To circumvent the difficulties of task data annotation, we propose a semi-automatic data construction pipeline based on human-machine collaboration, and establish a large-scale text polishing dataset consisting of 1.5 million instances. We propose two types of task-specific pre-training objectives for the text polishing task and implement a series of Transformer-based models pre-trained on a massive Chinese corpus as baselines. We conduct extensive experiments with the baseline models on the constructed text polishing datasets and have some major findings. The human evaluation further reveals the polishing ability of the final system.
Read full abstract