Abstract

Given a data sequence, sequential pattern mining, which finds frequent sequence patterns among them, is an important data mining problem. However, in the existing sequential pattern mining, only the purchase order of the items is considered, and the position where the item is purchased is not considered. In this paper, we developed a sequential pattern mining algorithm using Apache spark. The proposed algorithm finds frequent sequential patterns in parallel by distributing data to several machines. Experimentally, we performed a comprehensive performance study on the proposed algorithm by varying various parameter values using various synthetic data. Experimental results show that the proposed algorithm shows a linear speed improvement over the number of machines.

Highlights

  • The development of IT technology and the computer and internet industries has increased the need to handle large amounts of data in modern society

  • We propose location-based sequential pattern mining algorithm based on PrefixSpan to handle location data

  • EXPERIMENTS we evaluate the performance of our proposed two sequential pattern mining algorithms, Naïve Location-based PrefixSpan (NLPS), and the MapReduce Location-based PrefixSpan (MRLPS)

Read more

Summary

INTRODUCTION

The development of IT technology and the computer and internet industries has increased the need to handle large amounts of data in modern society. We have developed a sequential pattern mining algorithm that considers the purchase location using MapReduce programming model on Hadoop distributed environment. Problem: Given a database that contains m location-based sequences and a specified minimum support δ, the problem is to find all set of sequential patterns in the database. Most of them were sequential pattern candidates generated by apriori-style method This approach had to tally a set of many candidate sequence patterns, and had to scan the database multiple times to find long-length sequential patterns. To solve this problem, PrefixSpan (Prefix-projected Sequential pattern mining) algorithm has been proposed. The support count of β in α-projected database D|r, denoted as supportD|r (β), is the number of sequences γ in D|r

ALGORITHMS
NAÏVE APPROACH
EXPERIMENTS
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.