Abstract: The authors introduce a novel statistical modeling technique to cluster analysis and apply it to financial data. Their two main goals are to handle missing data and to find homogeneous groups within the data. Their approach is flexible and handles large and complex data structures with missing observations and with quantitative and qualitative measurements. The authors achieve this result by mapping the data to a new structure that is free of distributional assumptions in choosing homogeneous groups of observations. Their new method also provides insight into the number of different categories needed for classifying the data. The authors use this approach to partition a matched sample of stocks. One group offers dividend reinvestment plans, and the other does not. Their method partitions this sample with almost 97 percent accuracy even when using only easily available financial variables. One interpretation of their result is that the misclassified companies are the best candidates either to adopt a dividend reinvestment plan (if they have none) or to abandon one (if they currently offer one). The authors offer other suggestions for applications in the field of finance. JEL classification: G20, G29, G35 Key words: dividend reinvestment, Bayesian analysis, Gibbs sampler, clustering Analyzing Imputed Financial Data: A New Approach to Cluster Analysis 1. Introduction We introduce and apply a novel statistical approach to cluster analysis for financial data in this paper. We have two main goals. First, we wish to handle cases in which a subset of variables is missing for some observations. Second, we wish to find homogeneous groups within the data. Put differently, we want to determine the most likely number of categories comprising the data, and to assign observations to those categories optimally. Our approach is flexible in that it handles large and complex data structures with missing observations and with both quantitative and qualitative measurements. We achieve this by mapping the data to a new structure that is free of distributional assumptions in choosing homogeneous groups of observations. For example, when processing credit card transactions of customers, a company may want to explore the possibility of encouraging different or additional transactions by those customers. In this case, the task is to find homogeneous transactions and to forecast the willingness of a new customer to use the credit card to make a different or additional transaction, even if the data are not continuous and even if there are missing data. Our new method also provides the researcher with insight into the number of different categories needed for classifying the data. Classification methods have a long history of productive uses in business and finance. Perhaps the most common are discrete choice models. Among these, the multinomial logit approach has been used at least as far back as Holman and Marley (in Luce and Suppes, 1965). McFadden (1978) introduced the Generalized Extreme Value model in his study of residential location, and Koppelman and Wen (1997) have recently developed newer variations. The nested logit model of Ben-Akiva (1973) is designed to handle correlations among alternatives. Yet another variation of multinomial logic has been developed or used by Bierlaire, Lotan and Toint (1997). More recently, Calhoun and Deng (2000) use multinomial logit models to study loan terminations. Another form of discrete choice model is cluster analysis. Shaffer (1991) offers one example. He studies federal deposit insurance funding and considers its influence on taxpayers. Dalhstedt, Salmi, Luoma, and Laakkonen (1994) use cluster analysis to demonstrate that comparing financial ratios across firms is problematic. They argue that care is necessary even when the firms belong to the same official International Standard Industrial Classification category. von Altrock (1995) explains how fuzzy logic, a variation of cluster analysis, can be useful in practical business applications. …
Read full abstract