Abstract
Cloud computing is increasingly being regarded as a key enabler of the 'democratization of science', because on-demand, highly scalable cloud computing facilities enable researchers anywhere to carry out data-intensive experiments. In the context of natural language processing (NLP), algorithms tend to be complex, which makes their parallelization and deployment on cloud platforms a non-trivial task. This study presents a new, unique, cloud-based platform for large-scale NLP research--GATECloud. net. It enables researchers to carry out data-intensive NLP experiments by harnessing the vast, on-demand compute power of the Amazon cloud. Important infrastructural issues are dealt with by the platform, completely transparently for the researcher: load balancing, efficient data upload and storage, deployment on the virtual machines, security and fault tolerance. We also include a cost-benefit analysis and usage evaluation.
Highlights
The continued growth of unstructured content and the availability of ever more powerful computers have resulted in an increased need for researchers in diverse fields to carry out language-processing and text-mining experiments on very large document collections
In the context of natural language-processing (NLP) research, large-scale algorithms are demonstrating increasingly superior results compared with approaches trained on smaller datasets, mostly thanks to addressing the data sparseness issue through collection of significantly larger numbers of naturally occurring linguistic examples [1]
We have developed a novel, unique, cloud-based platform for large-scale NLP research—GATECloud.net
Summary
The continued growth of unstructured content and the availability of ever more powerful computers have resulted in an increased need for researchers in diverse fields (e.g. humanities, social sciences, bioinformatics) to carry out language-processing and text-mining experiments on very large document collections (or corpora). An additional impetus is the availability of key datasets, e.g. Wikipedia and Freebase snapshots, which can help with experimental repeatability Many of these datasets are impossible to process in reasonable time on standard computers such as desktop machines or individual servers. NLP algorithms tend to be complex, which makes deployment on cloud platforms a specialized, non-trivial task, with its own associated costs in terms of significant time overhead and expertise required. To answer these challenges, we have developed a novel, unique, cloud-based platform for large-scale NLP research—GATECloud.net. It aims to give researchers access to specialized software and enables them to carry out large-scale NLP experiments by harnessing the vast, on-demand compute power of the Amazon cloud. The study concludes with a number of use cases and evaluation experiments (§5) and a discussion of future work (§6)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.