Big data opportunities and challenges for IR, text mining and NLP

Beth Plale

doi:10.1145/2513549.2514739

Abstract

Big Data poses challenges for text analysis and natural language processing due to its characteristics of volume, veracity, and velocity of the data. The sheer volume in terms of numbers of documents challenges traditional local repository and index systems for large-scale analysis and mining. Computation, storage and data representation must work together to provide rapid access, search, and mining of the deep knowledge in the large text collection. Text under copyright poses additional barriers to computational access, where analysis has to be separated from human consumption of the original text. Data preprocessing, in most cases, remains a daunting task for big textual data particularly data veracity is questionable due to age of original materials. Data velocity is rate of change of the data but can also be the rate at which changes and corrections are made.The HathiTrust Research Center (HTRC) provides new opportunities for IR, NLP and text mining research. HTRC is the research arm of HathiTrust, a consortium that stewards the digital library of content from research libraries around the country. With close to 11 million volumes in HathiTrust collection, HTRC aims to provide large-scale computational access and analytics to these text resources.With the goal of facilitating scholar's work, HTRC establishes a cyberinfrastructure of software, staff, and services to assist researchers and developers more easily process and mine large scale textual data effectively and efficiently. The primary users of HTRC are digital humanities, informatics, and librarians. They are of different research backgrounds and expertise and thus a variety of tools are made available to them.In the HTRC model of computing, computation moves to the data, and services grow up around the corpus to serve the research community. In this manner, the architecture is cloud-based. Moving algorithms to the data is important because the copyrighted content must be protected, however, a side benefit is that the paradigm frees scholars from worrying about managing a large corpus of data.The text analytics currently supported in HTRC is the SEASR suite of analytical algorithms (www.seasr.org). SEASR algorithms, which are written as workflows, include entity extraction, tag cloud, topic modeling, NaiveBayes, Date Entities to Similie Timeline.In this talk, I introduce the collections, architecture, and text analytics of HTRC, with a focus on the challenges of a BigData corpus and what that means for data storage, access, and large-scale computation.HTRC is building a user community to better understand and support researcher needs. It opens many exciting possibilities for the NLP, text mining, IR types of research: with so large an amount of textual data and many candidate algorithms, with support for researcher contributed algorithms, many interesting research questions emerge and many interesting results are to follow.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Big data opportunities and challenges for IR, text mining and NLP

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Digitisation, Big Data, and the Future of the Medical Humanities: Text-Mining and the History of Medicine: Big Data, Big Questions?
Elizabeth Toon ... Michael Worboys
Medical History | VOL. 60
Elizabeth Toon, et. al.Elizabeth Toon ... Michael Worboys
14 Mar 2016
Medical History | VOL. 60

Big Data at Scale for Digital Humanities
Stacy T Kowalczyk ... Yiming Sun
-
Stacy T Kowalczyk, et. al.Stacy T Kowalczyk ... Yiming Sun
01 Jan 2014
01 Jan 2014

Big Data at Scale for Digital Humanities
Stacy T Kowalczyk ... Loretta Auvil
-
Stacy T Kowalczyk, et. al.Stacy T Kowalczyk ... Loretta Auvil
01 Jan 2015
01 Jan 2015

The HathiTrust Research Center
J Stephen Downie
-
J Stephen DownieJ Stephen Downie
21 Jun 2015
21 Jun 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Big data opportunities and challenges for IR, text mining and NLP

Abstract

Talk to us

Similar Papers