Syntactically-informed representation for sentence selection

Maria Fernanda Caropreso

doi:10.20381/ruor-19872

Abstract

Sentence Selection consists of identifying the sentences relevant to a particular topic, task, user or linguistic structure. This is a prerequisite step in many document-processing tasks, such as Information Extraction and Text Summarization. Researchers in these areas have typically borrowed ideas and tools from the Automatic Text Categorization (ATC) domain and applied them to their Sentence Selection problems. This is the case with the standard Bag of Words text representation and the machine learning algorithms used. Even though Sentence Selection and ATC are related, not all their characteristics are the same. Because of their differences, some variations to the standard representations and techniques usually used for ATC might be beneficial for Sentence Selection. Consequently, the main contribution of this thesis is the exploration of the benefits of a syntactically and semantically enriched text representation for the Sentence Selection task on technical domains. We further take advantage of the syntactic and semantic relations between words by moving from the propositional learners to a relational model. In particular, we experiment on three documents datasets, two in the Genetics domain and one in the Legal domain. In the first two domains we incorporate semantic knowledge by means of hierarchical dictionaries, while in the third one we use Named Entity Recognition for the same purpose. The syntactic knowledge is obtained from automatic parsing. Bags of words, enriched with the syntactic and semantic features, are given as input to different classifiers induction algorithms. Sentences to be selected constitute the positive class, while the remaining sentences of a document constitute the negative class. We present results with the state of the art algorithms Naive Bayes and Support Vector Machine, as well as the relational learner Aleph. We evaluate the learning performance by comparing on several runs of N-fold cross-validation and time based training/testing splits, and we study the implications of the classification threshold when appropriate. We show the gains of enriching the representation in a syntactic or semantic way, we analyze the cases in which each one is more beneficial, and we explain the particular contributions of each of them.

Full Text