AtmPlug-In for Distributed Text Mining inR

Stefan Theußl,Ingo Feinerer,Kurt Hornik

doi:10.18637/jss.v051.i05

Abstract

R has gained explicit text mining support with the tm package enabling statisticians to answer many interesting research questions via statistical analysis or modeling of (text) corpora. However, we typically face two challenges when analyzing large corpora: (1) the amount of data to be processed in a single machine is usually limited by the available main memory (i.e., RAM), and (2) the more data to be analyzed the higher the need for efficient procedures for calculating valuable results. Fortunately, adequate programming models like MapReduce facilitate parallelization of text mining tasks and allow for processing data sets beyond what would fit into memory by using a distributed file system possibly spanning over several machines, e.g., in a cluster of workstations. In this paper we present a plug-in package to tm called tm.plugin.dc implementing a distributed corpus class which can take advantage of the Hadoop MapReduce library for large scale text mining tasks. We show on the basis of an application in culturomics that we can efficiently handle data sets of significant size.

Highlights

In the information age statisticians are confronted with an ever increasing amount of data stored electronically (Gantz et al 2008)
This paper has introduced an approach applying the distributed programming paradigm MapReduce to advance feasibility and performance of suitable text mining tasks in R
We showed that distributed memory systems can be effectively employed within this model to preprocess large data sets by adding layers to existing text mining infrastructure packages

Summary

Introduction

In the information age statisticians are confronted with an ever increasing amount of data stored electronically (Gantz et al 2008). In a recent publication in Science, Michel et al (2011) use 15% of the digitized Google books content (4% of all books ever printed) to study the diffusion of regular English verbs and to probe the impact of censorship on a person’s cultural influence over time. This led to the advent of a new research field called Culturomics, the application of high-throughput data collection and analysis to the study of human culture

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Statistical Software	Publication Date: Jan 1, 2012
Citations: 19	License type: cc-by

R Discovery Prime

R Discovery Prime

AtmPlug-In for Distributed Text Mining inR

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Statistical Software

Lead the way for us

Similar Papers

Knowledge based word-concept model estimation and refinement for biomedical text mining
Antonio Jimeno Yepes ... Rafael Berlanga
Journal of Biomedical Informatics | VOL. 53
Antonio Jimeno Yepes, et. al.Antonio Jimeno Yepes ... Rafael Berlanga
12 Dec 2014
Journal of Biomedical Informatics | VOL. 53

Text Mining
Elizabeth D Liddy
Bulletin of the American Society for Information Science and Technology | VOL. 27
Elizabeth D LiddyElizabeth D Liddy
01 Oct 2000
Bulletin of the American Society for Information Science and Technology | VOL. 27

The New Legal Landscape for Text Mining and Machine Learning
Matthew Sag
SSRN Electronic Journal | VOL. -
Matthew SagMatthew Sag
26 Feb 2019
SSRN Electronic Journal | VOL. -

Systematising the LCA approaches’ soup: a framework based on text mining
Roberta Di Bari ... Rafael Horn
The International Journal of Life Cycle Assessment | VOL. -
Roberta Di Bari, et. al.Roberta Di Bari ... Rafael Horn
02 Jul 2024
The International Journal of Life Cycle Assessment | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

AtmPlug-In for Distributed Text Mining inR

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Statistical Software