Multidimensional Data Space Research Articles

Continuous monitoring of water surfaces is essential for water resource management. This study presents a nonparametric unsupervised automatic algorithm for the identification of inland water pixels from multispectral satellite data using multidimensional clustering and a high-performance subsampling approach for large scenes. Clustering analysis is a technique that is used to identify similar samples in a multidimensional data space. The spectral information and derived indices were used to characterize each scene pixel individually. A machine learning approach with random subsampling and generalization through a Naïve Bayes classifier was also proposed to make the application of complex algorithms to large scenes feasible. Accuracy was evaluated using an independent dataset that provides water bodies in 15 Sentinel-2 images over France acquired in different seasons and that covers a large range of water bodies and water colour types. The validation dataset covers a water surface of more than 1200 km2 (approximately 12 million pixels) including over 80,000 water bodies outlined using a semiautomatic active learning method, which were manually revised. The classification results were compared to the water pixel classification using three of the major Level 2A processors (MAJA, Sen2Cor and FMask) and two of the most common thresholding techniques: Otsu and Canny-edge. An input mask was used to remove coastal waters, clouds, shadows and snow pixels. Water pixels were identified automatically from the clustering process without the need for ancillary or pretrained data. Combinations using up to three water indices (Modified Normalized Difference Water Index-MNDWI, Normalized Difference Water Index-NDWI and Multiband Water Index-MBWI) and two reflectance bands (B8 and B12) were tested in the algorithm, and the best combination was NDWI-B12. Of all the methods, our method achieved the highest mean kappa score, 0.874, across all tested scenes, with a per-scene kappa ranging from 0.608 to 0.980, and the lowest mean standard deviation of 0.091. Standard Otsu's thresholding had the worst performance due to the lack of a bimodal histogram, and the Canny-edge variation achieved an overall kappa of 0.718 when used with the MNDWI. For water masks provided by generic processors, FMask outperformed MAJA and Sen2Cor and obtained an overall kappa of 0.764. In-depth analysis shows a quick drop in performance for all of the methods in identifying water bodies with a surface area below 0.5 ha, but the proposed approach outperformed the second best method by 34% in this size class.

Read full abstract

Many fundamental multivariate methods use the F distribution and its associated tests and critical values: it is the basis of the many common statistical tests in chemometrics, for example, for detecting outliers or whether an observation belongs to a predefined class. We have seen that the t distribution is appropriate for estimating critical values or confidence limits when a population has an underlying normal distribution, but the sample size is small. This is primarily a consequence of the difficulty of determining a population standard deviation, and using a method that more often than not underestimates it, the apparent distribution from the mean is distorted. When we discussed the chi squared distribution 1, we noted that this represented the distribution of squared Mahalanobis distances from the mean, and in particular that if more than one variable is measured, there is no specific positive or negative direction, and as such, using squared distances (which are independent of direction) was essential. Hence, the chi-square distribution naturally extends from univariate to multivariate data. The F distribution can be regarded as the equivalent extension of the t distribution when there is more than one variable but small sample sizes. There are numerous ways of introducing this distribution in the literature, which is widely employed in many diverse areas. In this and the next article, we focus primarily on the distribution of data in multidimensional space: the F distribution is often introduced in the context of analysis of variance. We will come across this distribution and its associated statistic in other contexts in later articles. The F is named after R. A. Fisher, who was a pioneering statistician most active in the 1920s and 1930s, and who worked in agricultural science in the UK. Many of the fundamental multivariate methods such as several approaches for one class classification or class modelling, including SIMCA and multivariate statistical process control, use the F distribution and its associated tests and critical values. It is the basis of the many common statistical tests in chemometrics, for example, for detecting outliers or whether an observation belongs to a predefined class. However, in order to understand it, it is necessary to also understand its relationship to other distributions. If there are a large number of observations (i.e. ν2 is large), then the shape of the F distribution is very similar to the chi squared distribution with ν1 degrees of freedom as illustrated in Figure 2, although there is a shift in position (in fact, chi squared equals ν1 F, and for 1 degree of freedom, they are both the same as ν1 = 1). Note that if both ν1 and ν2 are large, the F distribution also resembles the normal distribution, with a mean of 1. Figure 3 illustrates several different F distributions. We can note several things. If ν1 and ν2 are large, then we can see from the equations earlier that the mean is approximately equal to 1 and the variance to 4/ν2, so the larger ν2, the sharper the Gaussian (and more symmetric). This is illustrated in Figure 4 in the case where ν1 = 1000. Such situations, whilst rare in traditional statistical applications, may often be encountered in chemometrics where there may be a large number of variables, although would still require large sample sizes. Note that the mode changes position as the distribution becomes more symmetric as ν2 increases. In traditional statistical texts, it is usual to present F distribution tables. Because there are rather many possible F distributions, these are usually presented as critical values. A critical value of p = 0.01 gives the value of the F statistic that is expected to be exceeded by only 1% of the data, or in some cases, this can be called the 99% confidence limit. These tables are self-evident and are given in Tables 1 and 2 for two critical values. Note that F tables can be presented for different critical values. There are several more comprehensive tables available on the web 3, 4 although it is recommended that p values are calculated in Excel or any other common environment. Note that the tables later are presented for the one-tailed F cdf in this article. In some contexts, it is appropriate to look at two-tailed F tests, but we will not at this phase be concerned with this.

Read full abstract

Multidimensional Data Space Research Articles

Related Topics

Articles published on Multidimensional Data Space

DKPE-GraphSYN: a drug synergy prediction model based on joint dual kernel density estimation and positional encoding for graph representation.

Применение агентной модели интерактивной визуализации для создания средств визуального управления онтологическими данными

Robust reservoir identification by multi-well cluster analysis of wireline logging data

Neural network analysis of energy efficiency of the regional economy as a factor of Russia's sustainable development under conditions of big challenges

Analysis of human capital development in Russia by means of physical culture and sports using neural network modeling

Granular data representation under privacy protection: Tradeoff between data utility and privacy via information granularity

Pattern Labelling of Business Communication Data

Triboinformatics: machine learning algorithms and data topology methods for tribology

The benefits and dangers of using artificial intelligence in petrophysics

Element Abundance Analysis of the Metal-rich Stellar Halo and High-velocity Thick Disk in the Galaxy

Automatic water detection from multidimensional hierarchical clustering for Sentinel-2 images and a comparison with Level 2A processors

Gas Turbine Engine Condition Monitoring Using Gaussian Mixture and Hidden Markov Models

МНОГОМЕРНЫЙ АНАЛИЗ ДАННЫХ НА ОСНОВЕ РЕЗУЛЬТАТОВ РАСЧЕТОВ ЯДЕРНОЙ ТРАНСМУТАЦИИ В ЦИРКОНИЕВЫХ СПЛАВАХ

Detection of Power Quality Disturbance using a Multidimensional Approach in an Embedded System

A Novel Deep Recurrent Belief Network Model for Trend Prediction of Transformer DGA Data

Principal component analysis and cluster analysis for evaluating the natural and anthropogenic territory safety

A collaborative recommender system enhanced with particle swarm optimization technique

An all-sky support vector machine selection ofWISEYSO candidates

Image Understanding Applications of Lattice Autoassociative Memories.

The F distribution and its relationship to the chi squared and t distributions

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Multidimensional Data Space Research Articles

Related Topics

Articles published on Multidimensional Data Space

DKPE-GraphSYN: a drug synergy prediction model based on joint dual kernel density estimation and positional encoding for graph representation.

Применение агентной модели интерактивной визуализации для создания средств визуального управления онтологическими данными

Robust reservoir identification by multi-well cluster analysis of wireline logging data

Neural network analysis of energy efficiency of the regional economy as a factor of Russia's sustainable development under conditions of big challenges

Analysis of human capital development in Russia by means of physical culture and sports using neural network modeling

Granular data representation under privacy protection: Tradeoff between data utility and privacy via information granularity

Pattern Labelling of Business Communication Data

Triboinformatics: machine learning algorithms and data topology methods for tribology

The benefits and dangers of using artificial intelligence in petrophysics

Element Abundance Analysis of the Metal-rich Stellar Halo and High-velocity Thick Disk in the Galaxy

Automatic water detection from multidimensional hierarchical clustering for Sentinel-2 images and a comparison with Level 2A processors

Gas Turbine Engine Condition Monitoring Using Gaussian Mixture and Hidden Markov Models

МНОГОМЕРНЫЙ АНАЛИЗ ДАННЫХ НА ОСНОВЕ РЕЗУЛЬТАТОВ РАСЧЕТОВ ЯДЕРНОЙ ТРАНСМУТАЦИИ В ЦИРКОНИЕВЫХ СПЛАВАХ

Detection of Power Quality Disturbance using a Multidimensional Approach in an Embedded System

A Novel Deep Recurrent Belief Network Model for Trend Prediction of Transformer DGA Data

Principal component analysis and cluster analysis for evaluating the natural and anthropogenic territory safety

A collaborative recommender system enhanced with particle swarm optimization technique

An all-sky support vector machine selection ofWISEYSO candidates

Image Understanding Applications of Lattice Autoassociative Memories.

The F distribution and its relationship to the chi squared and t distributions