We propose a new method of constructing a variable bin width histogram that can accommodate the unbalanced distribution of the samples yet retaining, as a whole, the good aspect of both equal width (EW) and equal-area (EA) histograms that are being used popularly for data visualization and analysis. We formulate this as an optimal change point detection problem in which the bin boundaries are determined by minimizing the sum of the absolute error or the squared error in each bin. The former is based on Distance Minimization (DM) and new, and the latter is based on Variance Minimization (VM) and is considered the state-of-the-art. The constructed histograms can effectively be used to detect and visualize hidden outliers/anomalies by applying the interquartile range method in each bin. The final histograms are obtained by adjusting bin boundaries and heights accordingly after removing the detected outliers/anomalies. We further propose a method to annotate the constructed bins if the data for annotation is given for each sample as a set of nominal variables, using z-score with respect to their distribution within each bin. We applied our method to both real vinyl greenhouse datasets and two different sets of three synthetic datasets, and confirmed that both DM and VM methods work as intended, both can represent the sample distribution with a smaller number of bins than those by EW and EA methods, The use of interquartile range method can detect anomalies as well as outliers, and the terms selected for annotation are interpretable and reasonable. EW and EA methods have contrasting properties. DM and VM methods lie in between, but the former is closer to EA method and the latter to EW method. DM method runs substantially faster than VM method and performs slightly better than VM method in outlier detection and annotation tasks.
Read full abstract