Improving corpus reproducibility through modular text transformations and connected data set

Jonathan Pulliza,Chirag Shah

doi:10.1002/pra2.2018.14505501159

Abstract

ABSTRACTThe Enron Email Corpus is one of the most utilized collections of documents in Natural Language Processing, Machine Learning, and Network Analysis. Different groups of researchers have transformed the corpus, changing the content and format to meet their needs. The many distinct versions can all claim to be the Enron Email Corpus, though they are as distinct from the original publicly available collection as they are from each other. Researchers then have to determine the usefulness of a particular version in comparison to the many others available, as well as ascertain what has been done to the collection and how it would affect their specific research goal. This is especially important for reproducing a particular research method onto a different corpus, as transposing a method necessitates a deep understanding of the original data in the experiment. This project models the various transformations performed on different versions of the collection to form a network of connected datasets, highlighting the most important nodes as well as the most common transformations. Traversing different paths between nodes offers the community a way to model and reproduce the necessary data work performed on one collection that can be transposed onto other collections.

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Improving corpus reproducibility through modular text transformations and connected data set

Abstract

Talk to us

Similar Papers

More From: Proceedings of the Association for Information Science and Technology

Lead the way for us

Journal: Proceedings of the Association for Information Science and Technology	Publication Date: Jan 1, 2018
Citations: 1

Similar Papers

Network analysis of terms in the natural sciences insights from Wikipedia through natural language processing and network analysis
Peter Wulff
Education and Information Technologies | VOL. 28
Peter WulffPeter Wulff
05 Apr 2023
Education and Information Technologies | VOL. 28

Natural language processing and network analysis in patients withdrawing from life-sustaining treatments: a retrospective cohort study
Wei-Chin Tsai ... Hsien-Liang Huang
BMC Palliative Care | VOL. 21
Wei-Chin Tsai, et. al.Wei-Chin Tsai ... Hsien-Liang Huang
22 Dec 2022
BMC Palliative Care | VOL. 21

Application of Natural Language Processing and Network Analysis Techniques to Post-market Reports for the Evaluation of Dose-related Anti-Thymocyte Globulin Safety Patterns.
Kory Kreimeyer ... Deepa Arya
Applied clinical informatics | VOL. 8
Kory Kreimeyer, et. al.Kory Kreimeyer ... Deepa Arya
01 Apr 2017
Applied clinical informatics | VOL. 8

Abstract TP296: Predicting Cincinnati Prehospital Stroke Scale Components in Emergency Medical Services Patient Care Reports Using Natural Language Processing and Machine Learning
Ravi Garg ... Andrew Naidech
Stroke | VOL. 50
Ravi Garg, et. al.Ravi Garg ... Andrew Naidech
01 Feb 2019
Stroke | VOL. 50

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improving corpus reproducibility through modular text transformations and connected data set

Abstract

Talk to us

Similar Papers

More From: Proceedings of the Association for Information Science and Technology