Abstract

In today’s modern world, we’re experiencing a substantial increase in the use of data in various fields, and this has necessitated the use of distributed systems to consume and process Big Data. Machine learning tends to benefit from the usage of Big Data, and the models generated from such techniques tend to be more effective. However, there is a steep learning curve to getting used to handling Big Data, as traditional data management tools fail to perform well. Distributed systems have become popular, where the task of data processing is split amongst various nodes in clusters. SQL, is a popular database management language popular to data scientists. It is often given second class support, where SQL can be embedded into a primary language of use (e.g. SQL in Scala for Spark), which allows for using SQL but one still needs to know the primary language of the platform (Scala, as per the example, or ECL in HPCC Systems). It may also be present as a supported language. In either case, using useful tooling such as Visualizing data and creating and using machine learning models become difficult, as the user needs to fall back to the primary language of the system. In the proposed work, a new SQL-like language, HSQL, an open source distributed systems solution, was developed for allowing new users to get used to its distributed architecture and the ECL language, with which it primarily operates with (which was chosen as a target). Additionally, a program that could translate HSQL-based programs to ECL for use was made. HSQL was made to be completely inter-compatible with ECL programs, and it was able to provide a compact and easy to comprehend SQL-like syntax for performing general data analysis, creation of Machine learning models and visualizations while allowing a modular structure to such programs.

Highlights

  • Data has become an essential resource in this age of computing, where a lot of the advancements and innovations we see are based upon models which require huge amounts of data to be built

  • A design for HSQL is shown which presents a concrete syntax and an implementation for a compiler that translates the specification to ECL, the language used in HPCC Systems

  • As HSQL is intended to be for data analysts, ECL was chosen as a target due to its highly optimizing compiler and that machine learning performs exceptionally fast in HPCC Systems, even outperforming configured Hadoop for the first iteration of many configured Machine Learning Algorithms

Read more

Summary

INTRODUCTION

Data has become an essential resource in this age of computing, where a lot of the advancements and innovations we see are based upon models which require huge amounts of data to be built. There are places where SQL have first-class support, but here access to valuable tools such as visualizations and working with Machine Learning Models as well as commonplace language features get restricted, which have become commonplace and important since the time SQL was developed. Targeted to work with ECL, HSQL compliments ECL very well by providing an abstraction that is easy to use by data scientists who are already familiar with SQL; where data scientists can still take their time to learn the more complex and powerful ECL language for any complex solution they may require. A design for HSQL is shown which presents a concrete syntax and an implementation for a compiler that translates the specification to ECL, the language used in HPCC Systems

HPCC SYSTEMS AND ECL
DESIGN AND IMPLEMENTATION
Defining the HSQL Syntax
Language Recognition
Transpiler
CONCLUSIONS
Limitations and Future
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call