Analyzing the usage of source texts in new documents

Ansel Maclaughlin

doi:10.17760/d20406226

Abstract

When an author composes a new piece of text, they often rely on prior work for sources of inspiration and for specific text and ideas to paraphrase, respond to, or potentially copy. The nature of documents' relationships to their sources varies across genres of writing, from academic papers which formally cite and discuss their sources, to news articles which quote and paraphrase interviews and press releases, to a Tweet and its retweets and replies. This thesis studies aspects of this source-derived relationship, focusing on modeling what makes a source document or specific source passage worthy of attention in derived documents and how authors of derived documents use and transform source content to fit their needs. We first analyze the source selection process, investigating which factors, textual or otherwise, influence a document's likelihood of being used as a source by later, new documents. We focus on this selection problem in the field of science journalism, analyzing what sorts of scientific articles receive coverage in the news. Second, we study how authors edit and adapt source passages in their derived works, focusing on the task of intrinsic source attribution: inferring which portions of a derived document were adapted from an unobserved source document. We again focus our study on science journalism, investigating how journalists reuse and adapt content from press releases in their health science news articles. Next, we broaden our focus beyond journalism, studying general methods for detecting passages in derived works that reuse and adapt ideas and texts from source documents. Unlike our study of intrinsic source attribution, in this work we explore the setting where both the source and derived documents are observed at training and inference time. Thus, instead of using only the language of the derived documents to make predictions, models must learn to lexically and semantically align passages across source and derived documents. Through an extensive set of experiments, we study the trade-offs between different sets of bag-of-words and neural models, gaining insight into what factors influence model performance on different datasets. Our final two works explore a narrower type of source-derived document relationship -- direct quotation. First, we design a novel quote recommendation task where models must learn to recommend relevant paragraphs and quotes from a source document to authors of new documents based on the content they have already written. We again explore applications to journalism, evaluating our quote recommendation models on a set of news articles which report on and quote from a set of source presidential speeches. Finally, we investigate the textual factors influencing passage-level quotability. We cast the problem as passage ranking and explore applications of feature-based and neural models to multiple datasets spanning various source genres (e.g., poetry, books, speeches, essays) and languages (English, Latin).--Author's abstract

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Analyzing the usage of source texts in new documents

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

In Case You Haven't Heard…
-
Alcoholism & Drug Abuse Weekly | VOL. 34
--
29 Apr 2022
Alcoholism & Drug Abuse Weekly | VOL. 34

Towards Automatic Construction of News Overview Articles by News Synthesis
Jianmin Zhang ... Xiaojun Wan
-
Jianmin Zhang, et. al.Jianmin Zhang ... Xiaojun Wan
01 Jan 2017
01 Jan 2017

Source Attribution: Recovering the Press Releases Behind Health Science News
Ansel Maclaughlin ... John Wihbey
Proceedings of the International AAAI Conference on Web and Social Media | VOL. 14
Ansel Maclaughlin, et. al.Ansel Maclaughlin ... John Wihbey
26 May 2020
Proceedings of the International AAAI Conference on Web and Social Media | VOL. 14

사건중심 뉴스기사 자동요약을 위한 사건탐지 기법에 관한 연구
Young-Mee Chung ... Yong-Kwang Kim
Journal of the Korean Society for information Management | VOL. 25
Young-Mee Chung, et. al.Young-Mee Chung ... Yong-Kwang Kim
31 Dec 2009
Journal of the Korean Society for information Management | VOL. 25

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Analyzing the usage of source texts in new documents

Abstract

Talk to us

Similar Papers