Spark Job Research Articles

Geographic outliers at GBIF (Global Biodiversity Information Facility) are a known problem. Outliers can be errors, coordinates with high uncertainty, or simply occurrences from an undersampled region. Often in data cleaning pipelines, outliers are removed (even if they are legitimate points) because the researcher does not have time to verify each record one-by-one. Outlier points are usually occurrences that need attention. Currently, there is no outlier detection implemented at GBIF and it is up to the user to flag outliers themselves. DBSCAN (a density-based algorithm for discovering clusters in large spatial databases with noise) is a simple and popular clustering algorithm. It uses two parameters, (1) distance and (2) a minimum number of points per cluster, to decide if something is an outlier. Since occurrence data can be very patchy, non-clustering distance-based methods will fail often Fig. 1. DBSCAN does not need to know the expected number of clusters in advance. DBSCAN does well using only distance and does not require some additional environmental variables like Bioclim. Advanatages of DBSCAN : Simple Easy to understand Only two parameters to set Scales well No additional data sources needed Users would understand how their data was changed Simple Easy to understand Only two parameters to set Scales well No additional data sources needed Users would understand how their data was changed Drawbacks : Only uses distance Must choose parameter settings Sensitive to sparse global sampling Does not include any other relevant environmental information Can only flag outliers outside of a point blob Only uses distance Must choose parameter settings Sensitive to sparse global sampling Does not include any other relevant environmental information Can only flag outliers outside of a point blob Outlier detection and error detection are different. If your goal is to produce a system with no false positives, it will fail. While more complex environmentally-informed outlier detection methods (like reverse jackknifing (Chapman 2005)) might perform better for certain examples or even in genreal, DBSCAN performs adequately on almost everything despite being very simple. Currently I am using DBSCAN to find errors and assess dataset quality. It is a Spark job written in Scala (github). It does not run on species with lots of (&gt;30K) unique latitude-longitude points, since the current implementation relies on an in-memory distance matrix. However, around 99% of species (plants, animals, fungi) on GBIF have fewer than &gt;30K unique lat-long points (2,283 species keys / 222,993 species keys). There are other implementations ( example) that might scale to many more points. There are no immediate plans to include DBSCAN outliers as a data quality flag on GBIF, but it could be done somewhat easily, since this type of method does not rely on any external environmental data sources and already runs on the GBIF cluster.

Read full abstract

Apache Spark is a framework providing fast computations on Big Data using MapReduce model. With cloud environments Big Data processing becomes more flexible since they allow to create virtual clusters on-demand. One of the most powerful open-source cloud environments is Openstack. The main goal of this project is to provide an ability to create virtual clusters with Apache Spark and other Big Data tools in Openstack. There exist three approaches to do it. The first one is to use Openstack REST APIs to create instances and then deploy the environment. This approach is used by Apache Spark core team to create clusters in propriatary Amazon EC2 cloud. Almost the same method has been implemented for Openstack environments. Although since Openstack API changes frequently this solution is deprecated since Kilo release. The second approach is to integrate virtual clusters creation as a built-in service for Openstack. ISP RAS has provided several patches implementing universal Spark Job engine for Openstack Sahara and Openstack Swift integration with Apache Spark as a drop-in replacement for Apache Hadoop. This approach allows to use Spark clusters as a service in PaaS service model. Since Openstack releases are less frequent than Apache Spark this approach may be not convenient for developers using the latest releases. The third solution implemented uses Ansible for orchestration purposes. We implement the solution in loosely coupled way and provide an ability to add any auxiliary tool or even to use another cloud environment. Also, it provides an ability to choose any Apache Spark and Apache Hadoop versions to deploy in virtual clusters. All the listed approaches are available under Apache 2.0 license.

Read full abstract

Spark Job Research Articles

Related Topics

Articles published on Spark Job

Energy-aware scheduling for spark job based on deep reinforcement learning in cloud

Performance and Cost-Efficient Spark Job Scheduling Based on Deep Reinforcement Learning in Cloud Computing Environments

Research on the Allocation Method of Regional Science and Technology Resources from the Perspective of Rationality

Deadline-Aware Cost Optimization for Spark

Outlier Detection at GBIF Using DBSCAN

Balance resource allocation for spark jobs based on prediction of the optimal resource

A gray-box modeling methodology for runtime prediction of Apache Spark jobs

Erratic server behavior detection using machine learning on streams of monitoring data

A rewrite-based optimizer for Spark

Comparative Analysis of Energy-Efficient Scheduling Algorithms for Big Data Applications

Adding data provenance support to Apache Spark.

Создание виртуальных кластеров Apache Spark в облачных средах с использованием систем оркестрации

Titian

Complement or competition: Latino employment in a nontraditional settlement area

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Spark Job Research Articles

Related Topics

Articles published on Spark Job

Energy-aware scheduling for spark job based on deep reinforcement learning in cloud

Performance and Cost-Efficient Spark Job Scheduling Based on Deep Reinforcement Learning in Cloud Computing Environments

Research on the Allocation Method of Regional Science and Technology Resources from the Perspective of Rationality

Deadline-Aware Cost Optimization for Spark

Outlier Detection at GBIF Using DBSCAN

Balance resource allocation for spark jobs based on prediction of the optimal resource

A gray-box modeling methodology for runtime prediction of Apache Spark jobs

Erratic server behavior detection using machine learning on streams of monitoring data

A rewrite-based optimizer for Spark

Comparative Analysis of Energy-Efficient Scheduling Algorithms for Big Data Applications

Adding data provenance support to Apache Spark.

Создание виртуальных кластеров Apache Spark в облачных средах с использованием систем оркестрации

Titian

Complement or competition: Latino employment in a nontraditional settlement area