Abstract
Portals and gateways are increasingly offering users complex interfaces to interact with massive data sets. As dealing with big data becomes more commonplace, portal and gateway developers need to readdress how data is stored and rethink the supporting infrastructure that enables quick and simple access and analysis of data. It is becoming evident that traditional, relational databases are not always the most appropriate solution to allow users on-demand access to big data sets. In this study we show that using non-relational, NoSQL databases such as key-value stores and document stores can offer large benefits in performance, accessibility, and availability. We present a use case from the TeraGrid User Portal that demonstrates solutions for processing and auditing user job data efficiently in order to provide users rapid access to this data.One of the goals of TeraGrid User Portal is to offer users and PIs detailed job statistics such as service unit (SU) usage and job history via the user portal interface. While building a portal application to analyze batch job data records in the TeraGrid Central Database (TGCDB), we quickly ran into stumbling blocks. The TGCDB has over 17 million job records from December 2003 through March 2011. Between January 2011 and April 2011 alone, there are over 2.8 million job records. This data is growing at an ever-faster rate and will continue to grow as new computing resources become available. Even properly indexed tables took longer than ideal to query and still be responsive in a portal application. The current solution to this was to cache the jobs query results and access those cached results in the portal. This solved the issue with the speed of the query, but did not address the problem of dealing with this massive data set. We still needed the rich query interface that a database provides.In order to solve our issues we looked at a two different options. First, we tested moving the TGCDB to a newer, faster machine than the one it currently runs on to determine how much of the bottleneck was due to aging hardware. Second, we tested migrating the jobs data off of the relational PostgreSQL TGCDB and into a key-value store using Apache CouchDB instead of the flat file cache we had been using. CouchDB is a document-oriented database that is queried using MapReduce. CouchDB also offers specific benefits for portals and gateways, providing a RESTful JSON API that can be accessed using HTTP requests.Our initial tests have shown that moving the TGCDB to new hardware can provide a query speedup of 3.7x on average for the job queries we tested. Querying the same data using MapReduce queries to CouchDB gave an additional 8.24x speedup for a total of 30.6x speedup over the current TGCDB on average. The huge speedups offered by CouchDB come at the cost of additional disk usage. CouchDB maintains B-tree indices on the document store as well as any defined queries or âAIJviewsâAI. These indices use a greater amount of disk than a relational database, but enables CouchDB to take full advantage of high-performance disks and file systems.We show that the increase in performance gained from using a data warehouse for certain large data sets can offer great benefits to building on-demand data analysis tools in portals and gateways. By identifying these large data sets such as the TeraGrid jobs data and migrating them to high performance data stores such as CouchDB we can make much more information readily available to users.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.