Chapter 3 - Histograms: Looking at the Distribution of Data

Andrew F Siegel,Michael R Wagner

doi:10.1016/b978-0-12-820025-4.00003-8

Abstract

In this chapter, you will learn how to make sense of a list of numbers by visually interpreting the histogram picture whose bars rise above the number line (so that tall bars easily show you where lots of data are concentrated) and answering the following kinds of questions: One: What values are typical in this data set? Just look at the numbers below the tall histogram bars that indicate where there are many data values. Two: How different are the numbers from one another? Look at how spread out the histogram bars are from one another. Three: Are the data values strongly concentrated near some typical value? Look to see if the tall bars are close together. Four: What is the pattern of concentration? In particular, do data values “trail off” at the same rate at lower values as they do at higher values? Look to see if you have a symmetric, bell-shaped “normal” distribution or, instead, a skewed distribution with histogram bars trailing off differently on the left and right. You will learn how to ignore ordinary randomness when making this judgment. If you find skewness—which is common with business data that have many small-to-moderate data values and fewer very large values (think sizes of companies, with lots of small-to-medium-sized companies and then a couple of very large ones like Google, Microsoft, and Apple)—you might consider transforming these skewed data (perhaps by replacing data values with their logarithms) to make the distribution more normal-shaped (to help with validity of statistical methods we will learn in later chapters), although transformation will add complexity to the interpretation of the results. Five: Do you have two groups of data (a bimodal distribution) in your histogram? Look to see if there is a separation between two groups of histogram bars. You might choose to analyze these groups separately and explore the reason for their differences. You might even find three or more groups. Six: Are there special data values (outliers) very different from the rest that might require special treatment? Look for a short histogram bar separated from the rest of the data to represent each outlier. Because outliers can cause trouble (one outlier can greatly change a statistical summary, so that the summary no longer describes the rest of the data), you will want to identify outliers, fix them if they are simply copying errors, and (if they are not errors) perhaps delete them (but only if they are not part of what you wish to analyze) and perhaps analyze the data both with and without the outlier(s) to see the extent of their effects.

Full Text