Abstract

Crafting scalable analytics in order to extract actionable business intelligence is a challenging endeavour, requiring multiple layers of expertise and experience. Often, this expertise is irreconcilably split between an organisation’s engineers and subject matter domain experts. Previous approaches to this problem have relied on technically adept users with tool-specific training.Such an approach has a number of challenges: Expertise — There are few data-analytic subject domain experts with in-depth technical knowledge of compute architectures; Performance — Analysts do not generally make full use of the performance and scalability capabilities of the underlying architectures; Heterogeneity — calculating the most performant and scalable mix of real-time (on-line) and batch (off-line) analytics in a problem domain is difficult; Tools — Supporting frameworks will often direct several tasks, including, composition, planning, code generation, validation, performance tuning and analysis, but do not typically provide end-to-end solutions embedding all of these activities.In this paper, we present a novel semi-automated approach to the composition, planning, code generation and performance tuning of scalable hybrid analytics, using a semantically rich type system which requires little programming expertise from the user. This approach is the first of its kind to permit domain experts with little or no technical expertise to assemble complex and scalable analytics, for hybrid on- and off-line analytic environments, with no additional requirement for low-level engineering support.This paper describes (i) an abstract model of analytic assembly and execution, (ii) goal-based planning and (iii) code generation for hybrid on- and off-line analytics. An implementation, through a system which we call Mendeleev, is used to (iv) demonstrate the applicability of this technique through a series of case studies, where a single interface is used to create analytics that can be run simultaneously over on- and off-line environments. Finally, we (v) analyse the performance of the planner, and (vi) show that the performance of Mendeleev’s generated code is comparable with that of hand-written analytics.

Highlights

  • This paper presents a new approach to this problem, in providing a framework in which domain experts can compose and deploy efficient and scalable hybrid analytics without prior engineering knowledge

  • The remainder of this paper is structured as follows: Section 2 describes related work; Section 3 outlines the high-level approach adopted in this research and the implications of design choices; Sections 4 and 5 detail our approach to modelling analytics and planning their execution respectively; Section 6 describes the process of efficient code generation; Section 7 illustrates the application of this approach through four case studies; Sections 8 and 9 provide a performance evaluation of this framework and conclude the paper

  • It is important to note that the creation of this knowledge-base is beyond the scope of this research: it is assumed that engineers in organisations with a need for an analytic planning system are willing to undertake the manual annotation of the processing elements (PEs) they make available to their users

Read more

Summary

Introduction

Parallel architectures and engineering scalable systems, and the domain expert understands detailed semantics of their data and appropriate queries on that data. If user data is being crawled, for example, a streaming (on-line) analytic engine such as Apache Storm [2] or IBM InfoSphere Streams [27] might be employed for subset A, while person data in subset B might reside in an HDFS (Hadoop Distributed File System) [32] data store Each of these runtime environments specify their own programming model, optimisation constraints and engineering best practices. This complexity is increased when constructing a hybrid analytic which makes use of data from multiple runtimes: should subset C of this Flickr analytic be executed in an on- or offline runtime environment, and which configuration would be most performant and scalable?. The remainder of this paper is structured as follows: Section 2 describes related work; Section 3 outlines the high-level approach adopted in this research and the implications of design choices; Sections 4 and 5 detail our approach to modelling analytics and planning their execution respectively; Section 6 describes the process of efficient code generation; Section 7 illustrates the application of this approach through four case studies; Sections 8 and 9 provide a performance evaluation of this framework and conclude the paper

Related work
High-Level overview
Methodology
Impact of design choices
Modelling analytics
PE formalism
PE model abstraction
Goal-based planning
Type closure
Conditions
Code generation
DSL code generation
Native code generation
Integrating complex analytics
Case studies
Case study
Performance evaluation
PE Used
Runtime performance
30 Latency Time
Conclusions & further work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call