Automating and utilising equal-distribution data classification

Gennady Andrienko,Natalia Andrienko,Ibad Kureshi,Kieran Lee,Ian Smith,Toni Staykova

doi:10.1080/23729333.2020.1863000

Gennady Andrienko, Natalia Andrienko + Show 4 more

Open Access

https://doi.org/10.1080/23729333.2020.1863000

Copy DOI

Abstract

ABSTRACT Data classification, i.e. organising data items in groups (classes), is a general technique widely used in data visualisation and cartography, in particular, for creation of choropleth maps. Conventionally, data are classified by dividing the data range into intervals and assigning the same symbol or colour to all data falling within an interval. For instance, the intervals may be of the same length or may include the same number of data items. We propose a method for defining intervals so that some quantity represented by values of another attribute is equally distributed among the classes. This kind of classification supports exploratory analysis of relationships between the attribute used for the classification and the distribution of the phenomenon whose quantity is represented by the additional attribute. The approach may be especially useful when the distribution of the phenomenon is very unequal, with many data items having zero or low quantities and quite a few items having larger quantities. With such a distribution, standard statistical analysis of the relationships may be problematic. We demonstrate the potential of the approach by analysing data referring to a set of spatially distributed people (patients) in relationship to characteristics of the areas in which the people live.

Full Text