Natural language processing on noisy text

Rui Dong

doi:10.17760/d20416553

Abstract

In the past decades, online textual content has grown explosively along with various types of noise. For example, digital libraries might include documents with OCR errors when digitizing printed materials for easy access and retrieval. User published web content such as tweets and blogs might contain typing errors, grammatical errors, and factual errors. Motivated by the increasing concerns with noisy text, this thesis aim at contributing techniques for reducing the impact of errors in textual data on downstream tasks. We propose to provide a quantitative understanding of the effect of the noise on recurrent neural network language models and to train noise-tolerant language models for typing prediction. We also investigate decreasing the influence of noise by correcting errors in texts. An unsupervised approach is proposed to correct OCR errors by exploiting repeated texts in large corpora both as a source of noisy target outputs for unsupervised training and as a source of evidence when decoding. We further propose to tackle a more complicated and thus more challenging source of noise--factual errors--by automatic fact checking. Specifically, we aim to detect factual errors related to quantity mentions in textual content given tabular data as evidence. We propose to adapt Table Parsing (TAPAS), an extension of BERT pre-trained on structured data, to solve this problem. We investigate the effects of different ways of encoding table structure and numerical information on the fact verification accuracy. Different pre-training data and tasks are also compared when TAPAS model is fine-tuned for the table-based fact verification task. Last but not least, we focus on verifying complex statements that involve numerical reasoning over tabular data. We propose to apply semantic parsing to parse texts into executable logical forms and use an execution component to better handle numerical reasoning. Our proposed semantic parsing model, a structure-aware T5 model trained on optimized logical forms, is shown to be more effective than the state-of-the-art TAPAS-based classification or semantic parsing model on the table-based fact verification task.--Author's abstract

Full Text