Abstract
The purpose of this paper was to choose an appropriate information dissimilarity measure for hierarchical clustering of daily streamflow discharge data, from twelve gauging stations on the Brazos River in Texas (USA), for the period 1989–2016. For that purpose, we selected and compared the average-linkage clustering hierarchical algorithm based on the compression-based dissimilarity measure (NCD), permutation distribution dissimilarity measure (PDDM), and Kolmogorov distance (KD). The algorithm was also compared with K-means clustering based on Kolmogorov complexity (KC), the highest value of Kolmogorov complexity spectrum (KCM), and the largest Lyapunov exponent (LLE). Using a dissimilarity matrix based on NCD, PDDM, and KD for daily streamflow, the agglomerative average-linkage hierarchical algorithm was applied. The key findings of this study are that: (i) The KD clustering algorithm is the most suitable among others; (ii) ANOVA analysis shows that there exist highly significant differences between mean values of four clusters, confirming that the choice of the number of clusters was suitably done; and (iii) from the clustering we found that the predictability of streamflow data of the Brazos River given by the Lyapunov time (LT), corrected for randomness by Kolmogorov time (KT) in days, lies in the interval from two to five days.
Highlights
Cluster analysis is employed to identify the set of objects with similar characteristics or identify groups, and has a broad range of applications in science
InInthis thissection, section,we wedescribe describethe theselected selecteddissimilarity dissimilaritymeasures measures(compression-based, used measure, permutation distribution dissimilarity measure, and Kolmogorov distance), usedininthe the average-linkage average-linkageclustering clusteringhierarchical hierarchicalalgorithm, algorithm,which whichwas wasapplied appliedtotostreamflow streamflowdata datameasured measured from from12
General Features (i) The Brazos River course has a large interval of mean daily streamflow values, which ranged from 223.5 [Seymour station (1_08082500)] to 8851.4 m3 /s [Roshanor station (12_08116650)], as seen in
Summary
Cluster analysis ( called clustering) is employed to identify the set of objects with similar characteristics or identify groups, and has a broad range of applications in science (e.g., biology, computational biology and bioinformatics, medicine, hydrology, geosciences, business and marketing, computer science, social science, and others). The purpose can be stated as to: (i) Identify the underlying structures in data; (ii) summarize behaviors or characteristics; (iii) assign new individuals to groups; and (iv) identify totally atypical objects [1,2,3]. The active variables are often (but not always) numeric variables, while the illustrative variables are used for understanding the characteristics on which the clusters are based and, for their interpretation. The closeness of objects can be measured by the degree of distance (a dissimilarity measure) or by the degree of association (a measure of similarity between groups). If two objects are more alike the dissimilarity measure decreases, while the similarity measure increases [4]. There are different methods for quantifying the similarity or dissimilarity measure and, clustering, such as partitioning, hierarchical, fuzzy, density-based, and model-based
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.