Abstract

When an author composes a new piece of text, they often rely on prior work for sources of inspiration and for specific text and ideas to paraphrase, respond to, or potentially copy. The nature of documents' relationships to their sources varies across genres of writing, from academic papers which formally cite and discuss their sources, to news articles which quote and paraphrase interviews and press releases, to a Tweet and its retweets and replies. This thesis studies aspects of this source-derived relationship, focusing on modeling what makes a source document or specific source passage worthy of attention in derived documents and how authors of derived documents use and transform source content to fit their needs. We first analyze the source selection process, investigating which factors, textual or otherwise, influence a document's likelihood of being used as a source by later, new documents. We focus on this selection problem in the field of science journalism, analyzing what sorts of scientific articles receive coverage in the news. Second, we study how authors edit and adapt source passages in their derived works, focusing on the task of intrinsic source attribution: inferring which portions of a derived document were adapted from an unobserved source document. We again focus our study on science journalism, investigating how journalists reuse and adapt content from press releases in their health science news articles. Next, we broaden our focus beyond journalism, studying general methods for detecting passages in derived works that reuse and adapt ideas and texts from source documents. Unlike our study of intrinsic source attribution, in this work we explore the setting where both the source and derived documents are observed at training and inference time. Thus, instead of using only the language of the derived documents to make predictions, models must learn to lexically and semantically align passages across source and derived documents. Through an extensive set of experiments, we study the trade-offs between different sets of bag-of-words and neural models, gaining insight into what factors influence model performance on different datasets. Our final two works explore a narrower type of source-derived document relationship -- direct quotation. First, we design a novel quote recommendation task where models must learn to recommend relevant paragraphs and quotes from a source document to authors of new documents based on the content they have already written. We again explore applications to journalism, evaluating our quote recommendation models on a set of news articles which report on and quote from a set of source presidential speeches. Finally, we investigate the textual factors influencing passage-level quotability. We cast the problem as passage ranking and explore applications of feature-based and neural models to multiple datasets spanning various source genres (e.g., poetry, books, speeches, essays) and languages (English, Latin).--Author's abstract

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call