Abstract

With the emergence of commodity hardware architectures and distributed open source software, users are performing analytics on more types of data. Web 2.0 applications like social networking sites have to deal with a lot of meta-data which in some cases can't fit into main memory. Currently, it is the responsibility of the application programmers to manually map these in-memory data structures into persistent storage systems like a database or file system. Ideally, the application programmers would like the underlying programming language/middle ware software to seamlessly manage the scalable data structures.It is increasingly becoming hard to use the traditional database or storage controller systems to store this metadata because of cost and scale reasons. Thus, new NoSQL database architectures are emerging that are built on commodity hardware architectures and they can scale to large sizes in an incremental manner. Thus, there is an opportunity for the builders of NoSQL systems to provide scalable in-memory data structures. However, currently, these types of data structure interfaces are not available in the popular Hadoop NoSQL infrastructure. In this paper, we show how to implement the Set data structure and its operations in a scalable manner on top of Hadoop HBase. We then propose and implement optimizations for three Set operations. We also discuss the limitations of implementing this data structure in the Hadoop ecosystem. We evaluate our algorithms and optimizations on a real Hadoop cluster. Our primary conclusion is that the Hadoop ecosystem provides an excellent framework to implement scalable data structures.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call