Abstract

MapReduce is a widely adopted computing framework for data-intensive applications running on clusters. We propose an approach to exploit data parallelisms in XML processing using MapReduce in Hadoop. Our solution seamlessly integrates data storage, labelling, indexing, and parallel queries to process a massive amount of XML data. Specifically, we introduce an SDN labelling algorithm and a distributed hierarchical index using DHTs, we develop an efficient data retrieval approach called B-SLCA. More importantly, we design an advanced two-phase MapReduce solution that is able to efficiently address the issues of labelling, indexing, and query processing on big XML data. We implemented our solution on a real-world Hadoop cluster processing the real-world datasets. Our experimental results show that SDN outperforms NCIM by up to a factor of 1.36 with an average of 1.17, our BSLCA outperforms BwdSLCA by up to a factor of 1.96 with an average of 1.2.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call