Abstract

Cloud computing is increasingly being regarded as a key enabler of the 'democratization of science', because on-demand, highly scalable cloud computing facilities enable researchers anywhere to carry out data-intensive experiments. In the context of natural language processing (NLP), algorithms tend to be complex, which makes their parallelization and deployment on cloud platforms a non-trivial task. This study presents a new, unique, cloud-based platform for large-scale NLP research--GATECloud. net. It enables researchers to carry out data-intensive NLP experiments by harnessing the vast, on-demand compute power of the Amazon cloud. Important infrastructural issues are dealt with by the platform, completely transparently for the researcher: load balancing, efficient data upload and storage, deployment on the virtual machines, security and fault tolerance. We also include a cost-benefit analysis and usage evaluation.

Highlights

  • The continued growth of unstructured content and the availability of ever more powerful computers have resulted in an increased need for researchers in diverse fields to carry out language-processing and text-mining experiments on very large document collections

  • In the context of natural language-processing (NLP) research, large-scale algorithms are demonstrating increasingly superior results compared with approaches trained on smaller datasets, mostly thanks to addressing the data sparseness issue through collection of significantly larger numbers of naturally occurring linguistic examples [1]

  • We have developed a novel, unique, cloud-based platform for large-scale NLP research—GATECloud.net

Read more

Summary

Introduction

The continued growth of unstructured content and the availability of ever more powerful computers have resulted in an increased need for researchers in diverse fields (e.g. humanities, social sciences, bioinformatics) to carry out language-processing and text-mining experiments on very large document collections (or corpora). An additional impetus is the availability of key datasets, e.g. Wikipedia and Freebase snapshots, which can help with experimental repeatability Many of these datasets are impossible to process in reasonable time on standard computers such as desktop machines or individual servers. NLP algorithms tend to be complex, which makes deployment on cloud platforms a specialized, non-trivial task, with its own associated costs in terms of significant time overhead and expertise required. To answer these challenges, we have developed a novel, unique, cloud-based platform for large-scale NLP research—GATECloud.net. It aims to give researchers access to specialized software and enables them to carry out large-scale NLP experiments by harnessing the vast, on-demand compute power of the Amazon cloud. The study concludes with a number of use cases and evaluation experiments (§5) and a discussion of future work (§6)

Large-scale text mining and compute clouds
Towards an natural language-processing PaaS: requirements and methodology
Use cases and experiments
Conclusions and future work
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.