Abstract
We propose a new method of constructing a variable bin width histogram that can accommodate the unbalanced distribution of the samples yet retaining, as a whole, the good aspect of both equal width (EW) and equal-area (EA) histograms that are being used popularly for data visualization and analysis. We formulate this as an optimal change point detection problem in which the bin boundaries are determined by minimizing the sum of the absolute error or the squared error in each bin. The former is based on Distance Minimization (DM) and new, and the latter is based on Variance Minimization (VM) and is considered the state-of-the-art. The constructed histograms can effectively be used to detect and visualize hidden outliers/anomalies by applying the interquartile range method in each bin. The final histograms are obtained by adjusting bin boundaries and heights accordingly after removing the detected outliers/anomalies. We further propose a method to annotate the constructed bins if the data for annotation is given for each sample as a set of nominal variables, using z-score with respect to their distribution within each bin. We applied our method to both real vinyl greenhouse datasets and two different sets of three synthetic datasets, and confirmed that both DM and VM methods work as intended, both can represent the sample distribution with a smaller number of bins than those by EW and EA methods, The use of interquartile range method can detect anomalies as well as outliers, and the terms selected for annotation are interpretable and reasonable. EW and EA methods have contrasting properties. DM and VM methods lie in between, but the former is closer to EA method and the latter to EW method. DM method runs substantially faster than VM method and performs slightly better than VM method in outlier detection and annotation tasks.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.