The State of the Art in Symbolic Data Analysis: Overview and Future

Edwin Diday

doi:10.1002/9780470723562.ch1

Abstract

Databases are now ubiquitous in industrial companies and public administrations, and they often grow to an enormous size. They contain units described by variables that are often categorical or numerical (the latter can of course be also transformed into categories). It is then easy to construct categories and their Cartesian product. In symbolic data analysis these categories are considered to be the new statistical units, and the first step is to get these higher-level units and to describe them by taking care of their internal variation. What do we mean by ‘internal variation’? For example, the age of a player in a football team is 32 but the age of the players in the team (considered as a category) varies between 22 and 34; the height of the mushroom that I have in my hand is 9 cm but the height of the species (considered as a category) varies between 8 and 15 cm. A more general example is a clustering process applied to a huge database in order to summarize it. Each cluster obtained can be considered as a category, and therefore each variable value will vary inside each category. Symbolic data represented by structured variables, lists, intervals, distributions and the like, store the ‘internal variation’ of categories better than do standard data, which they generalize. ‘Complex data’ are defined as structured data, mixtures of images, sounds, text, categorical data, numerical data, etc. Therefore, symbolic data can be induced from categories of units described by complex data (see Section 1.4.1) and therefore complex data describing units can be considered as a special case of symbolic data describing higher-level units.

Full Text