Real-time user clickstream behavior analysis based on apache storm streaming

Gautam Pal,Katie Atkinson,Gangmin Li

doi:10.1007/s10660-021-09518-4

Gautam Pal, Katie Atkinson + Show 1 more

Open Access

https://doi.org/10.1007/s10660-021-09518-4

Copy DOI

Abstract

This paper presents an approach to analyzing consumers’ e-commerce site usage and browsing motifs through pattern mining and surfing behavior. User-generated clickstream is first stored in a client site browser. We build an ingestion pipeline to capture the high-velocity data stream from a client-side browser through Apache Storm, Kafka, and Cassandra. Given the consumer’s usage pattern, we uncover the user’s browsing intent through n-grams and Collocation methods. An innovative clustering technique is constructed through the Expectation-Maximization algorithm with Gaussian Mixture Model. We discuss a framework for predicting a user’s clicks based on the past click sequences through higher order Markov Chains. We developed our model on top of a big data Lambda Architecture which combines high throughput Hadoop batch setup with low latency real-time framework over a large distributed cluster. Based on this approach, we developed an experimental setup for an optimized Storm topology and enhanced Cassandra database latency to achieve real-time responses. The theoretical claims are corroborated with several evaluations in Microsoft Azure HDInsight Apache Storm deployment and in the Datastax distribution of Cassandra. The paper demonstrates that the proposed techniques help user experience optimization, building recently viewed products list, market-driven analyses, and allocation of website resources.

Highlights

E-Commerce sites track the consumers’ browsing patterns simultaneously in realtime and in batch mode
This paper introduces novel techniques in clickstream data analytics to unleash key customer journeys through pattern mining using the n-grams and Student T-Test, which distinguishes between regular patterns and special sequences [40, 48]
We presented a near real-time data storage and processing approach to analyze streams of data with Apache Storm and Cassandra NoSQL datastore

Summary

Introduction

E-Commerce sites track the consumers’ browsing patterns simultaneously in realtime and in batch mode. Mining browsing motifs to display personalized recommendations and near-real-time tracking of recently viewed products greatly enhances overall user experience and helps generate revenue. Near real-time processing of a large data pool for creating unique personalized, contextual experiences need quick analysis of the inflow of data before it is even stored in the database of records. In a Big Data real-time setting, instead of waiting for data to be gathered in its totality at a long periodic batch interval, the streaming analysis leads us to detect patterns and make informed conclusions based on them as data start arriving. Apache Storm is a popular real-time distributed processing framework allowing users in-flight processing on the inflow of data before it is even stored in the database

Methods

Discussion

Conclusion