Kernel Principal Component Analysis for Uncertain Data Objects and Its Application in Classification

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

ABSTRACT Uncertain data mining has been a growing field of research in recent years. Numerous data mining techniques for performing tasks such as clustering, classification, anomaly detection, and so on have been developed for uncertain data. Principal component analysis (PCA) and its extension, kernel principal component analysis (KPCA), are two well‐known techniques that are widely used for dimensionality reduction and feature extraction for traditional certain data. However, for uncertain data, these techniques have not been developed to the best of our knowledge. In this paper, uncertain principal component analysis (UPCA) and uncertain kernel principal component analysis (UKPCA) are developed. The proposed techniques consider the inherent uncertainty of the uncertain data, unlike the traditional techniques that ignore such uncertainty. In addition, in this paper, we propose the decision tree algorithm classification model combined with the developed UKPCA technique. The proposed model is capable of achieving high classification accuracy for both real‐world and synthetic data, especially for cases that involve classes having nonlinear/arbitrary shapes.

Similar Papers
  • Conference Instance
  • 10.1145/1610555
Proceedings of the 1st ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data
  • Jun 28, 2009

The importance of uncertain data is growing quickly in many essential applications such as environmental surveillance, mobile object tracking and data integration. Recently, storing, collecting, processing, and analyzing uncertain data has attracted increasing attention from both academia and industry. Analyzing and mining uncertain data needs collaboration and joint effort from multiple research communities including reasoning under uncertainty, uncertain databases and mining uncertain data. For example, statistics and probabilistic reasoning can provide support with models for representing uncertainty. The uncertain database community can provide methods for storing and managing uncertain data, while research in mining uncertain data can provide data analysis tasks and methods. It is important to build connections among those communities to tackle the overall problem of analyzing and mining uncertain data. There are many common challenges among the communities. One is to understand the different modeling assumptions made, and how they impact the methods, both in terms of accuracy and efficiency. Different researchers hold different assumptions and this is one of the major obstacles in the research of mining uncertain data. Another is the scalability of proposed management and analysis methods. Finally, to make analysis and mining useful and practical, we need real data sets for testing. Unfortunately, uncertain data sets are often hard to get. The goal of the First ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data (U'09) is to discuss in depth the challenges, opportunities and techniques on the topic of analyzing and mining uncertain data. The theme of this workshop is to make connections among the research areas of uncertain databases, probabilistic reasoning, and data mining, as well as to build bridges among the aspects of models, data, applications, novel mining tasks and effective solutions. By making connections among different communities, we aim at understanding each other in terms of scientific foundation as well as commonality and differences in research methodology. The workshop program is very stimulating and exciting. We are pleased to feature two invited talks by pioneers in mining uncertain data. Christopher Jermaine will give an invited talk titled "Managing and Mining Uncertain Data: What Might We Do Better?" Matthias Renz will address the topic "Querying and Mining Uncertain Data: Methods, Applications, and Challenges". Moreover, 8 accepted papers in 4 full presentations and 4 concise presentations will cover a bunch of interesting topics and on-going research projects about uncertain data mining.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/wcica.2012.6359194
A soft sensor method based on Integrated PCA
  • Jul 1, 2012
  • Weiming Shao + 1 more

Feature extraction methods such as Kernel Principal Component Analysis (KPCA) and Principal Component Analysis (PCA), are often used for soft sensor modeling in industrial process with high dimensional data. A kind of soft sensor method based on Integrated Principal Component Analysis (Integrated PCA) is proposed for some weakness of KPCA and that of PCA. This approach combines nonlinear information extracted by KPCA with linear information extracted by PCA and it can not only reduce the dimensionality of input data, but also make full use of linear and nonlinear information. Partial Least Squares (PLS) is used to obtain the final soft sensor model and Particle Swarm Optimization (PSO) is applied to get the optimal parameters of Integrated PCA and those of KPCA. Finally, the proposed method is applied to build soft sensor models of diesel oil boiling point and other industrial objects and is proved to have better ability of generalization by being compared with other feature extraction methods.

  • Research Article
  • 10.1145/1809400.1809419
Summary of the first ACM SIGKDD workshop on knowledge discovery from uncertain data (U'09)
  • May 27, 2010
  • ACM SIGKDD Explorations Newsletter
  • Jian Pei + 2 more

The importance of uncertain data is growing quickly in many essential applications such as environmental monitoring, mobile object tracking and data integration. Recently, storing, collecting, processing, and analyzing uncertain data has attracted increasing attention from both academia and industry. Analyzing and mining uncertain data needs collaboration and joint effort from multiple research communities. Based on this motivation, we ran the First ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data (U’09) in conjunction with the 2009 SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09) at Paris. The focus of this workshop was to bring together and bridge research in reasoning under uncertainty, probabilistic databases and mining uncertain data. Work in statistics and probabilistic reasoning can provide support with models for representing uncertainty, work in the probabilistic database community can provide methods for storing and managing uncertain data, while work in the mining uncertain data can define data analysis tasks and methods. It is important to build connections among those communities to tackle the overall problem of analyzing and mining uncertain data. There are many common challenges among the communities. One is understanding the different modeling assumptions made, and how they impact the methods, both in terms of accuracy and efficiency. Different researchers hold different assumptions about the semantics for probabilistic models and uncertainty. This is one of the major obstacles in the research of mining uncertain data. Another challenge is the scalability of proposed management and analysis methods. Finally, to make analysis and mining useful and practical, we need real data sets for testing. Unfortunately, uncertain data sets are often hard to get and hard to share. The theme of this workshop was to make connections among the research areas of probabilistic databases, probabilistic reasoning, and data mining, as well as to build bridges among the aspects of models, data, applications, novel mining tasks and effective solutions. By making connections among different communities, we aim at understanding each other in terms of scientific foundation as well as commonality and differences in research methodology. Although the workshop was allocated to only half day, we had a very dynamic and exciting program. The workshop was among one of the best attended ones in conjunction with the conference. There were about 40 attendees when the workshop started. We were lucky to have two excellent invited talks in the workshop. Professor Christopher Jermaine at Rice University gave a talk on “Managing and Mining Uncertain Data: What Might We Do Better?”. In this talk, he expressed a few of his strongly-held opinions on the management and mining of uncertain data. He argued that those who work in the field should listen very carefully to complaints from machine learning experts, who often say, “but all of our methods were already designed to work with uncertain data, so you are wasting your time!” Furthermore, he contended that too much work aimed at managing uncertainty is tightly coupled to first-order logic and related ideas. He also argued that Bayesian approaches and Monte Carlo methods should be much more widely employed in this area. Finally, he argued that too much work in this area neglects the application domains where uncertainty is most important: “what if” analysis, risk assessment, and predication. In his invited talk titled “Querying and Mining Uncertain Data: Methods, Applications, and Challenges”, Dr. Matthias Renz at Ludwig-Maximilians Universitat (LMU) Munchen summarized several very interesting projects in his group exploring various aspects of mining uncertain data, particularly from the point of view of efficiency. The efficiency concern is particularly important for modern databases since they allow users to incorporate uncertainty of data in the hope of increasing the quality of query results. Dr. Matthias Renz gave an overview of modeling uncertain data in feature spaces and illustrated diverse probabilistic similarity search methods which are important tools for many mining applications. In this context, he discussed some current methods as well as the challenges in clustering uncertain data and mining probabilistic rules. The two invited talks were very successful — they led to interesting discussions among the audience and the invited speakers. The invited speeches helped to highlight the interdisciplinary nature of the workshop. The program committee accepted eight papers — four of them were 15 minute presentations and the other four were 10 minute presentations. In the paper titled “Efficient Algorithms for Mining Constrained Frequent Patterns from Uncertain Data”, Leung and Brajczuk argue that constrained frequent pattern mining from uncertain data is important since constrained frequent pattern mining and mining frequent patterns from uncertain data often happen in some common applications such as analyzing medical laboratory data. They developed

  • Conference Article
  • 10.1117/12.833212
Classification of multispectral remote sensing image using Kernel Principal Component Analysis and neural network
  • Oct 30, 2009
  • Jie Yu + 4 more

A method combined Kernel Principal Component Analysis (KPCA) with BP neural network is proposed for multispectral remote sensing image classification in this paper. Firstly, the KPCA transformation including Gaussian KPCA and polynomial KPCA is carried out to get the former three uncorrelated bands containing most information of the TM images with seven bands. Secondly, BP neural network classification is executed using the three bands data after KPCA transformation. For testifying, both the classical PC A and the KPCA are applied to the multispectral Landsat TM data for feature extraction. The results demonstrate that the method proposed in this paper can improve the classification accuracy compared with that of principal component analysis (PCA) and BP neural network. Keywords: Kernel principal component analysis, BP neural network, Multispectral remote sensing, Classification 1. INTRODUCTION Feature reduction in a remote sensing dataset is often desirable to decrease the processing time required to perform a classification. Principal component analysis (PCA) is a common method for image enhancement and compression. PCA maximizes the projection variance in the previous r vectors (r refers to the number of dimension which need to be reduced) according to search this r orthogonal eigenvectors (corresponding to previous r maximal eigenvalues) [1]. However, PCA method is a linear mapping algorithm in nature; it only extracts the linear features but loss of the non-linear features. Therefore, kernel principal component analysis (KPCA) is been put forward to deal with the non-linear problems in some re ferences [2, 3]. Recently, kernel-based learning algorithms which had been proved to be a promising method for tackling nonlinear systems have attracted much attention of researcher s in the field of machines learning. KPCA is applied to many fields including failure detection in waste water treatment plants [4–6], data denoising [7], recognition of ha ndwritten digits [8] and classification of genetic data [9] an so on. In recent years, there are some papers using KPCA method for feature extraction in image processing. A kernel machine-based discriminant analysis method presented by Juwei Lu et al deals with the nonlinearity of the face patterns' distribution [11]. The application of KPCA for dimension reduction on remote sensin g datasets with inherent non-linear structure was present by John Tan et al [12]. A method combined KPCA and SAM provided by Zhang Youjing et al. has been shown to yield high classification accuracy for city’s vegetation [13]. KPCA feature extraction based on Mahalanobis distance Fuzzy C-Means genetic algorithm provided by Chang Ruichun has been shown to yield high classification accuracy in extracting desertification nonlinear feature [14].

  • Book Chapter
  • Cite Count Icon 3
  • 10.1007/978-981-10-6571-2_271
Anomaly Detection Based on Kernel Principal Component and Principal Component Analysis
  • Jun 7, 2018
  • Wei Wang + 5 more

Nowadays, behind wall human detection based on UWB radar signal, which it had a strong anti-jamming performance, was an important problem. In this setting, principal component analysis (PCA) as an anomaly detection method was used, but PCA could only deal with linear data. Thus, we introduced the kernel principal component analysis (KPCA) for performing a nonlinear form of principal component analysis (PCA). We obtained the different state data based on UWB radar signal for the behind wall human detection. These data were used as training and testing data to calculate the squared prediction error (SPE) values that detect anomalies. The experimental results showed that the introduced approach of KPCA effectively captured the nonlinear relationship in the process data and showed superior process monitoring performance compared to linear PCA.

  • Conference Article
  • 10.1109/fskd.2008.162
A KPCA RNN Based Model for the Area Flowing of Graduate Employment Forecasting
  • Oct 1, 2008
  • Cheng-Ssong Qing + 1 more

Searching influence variables as well as forecasting the flowing of graduate employment is an ongoing activity of considerable significance. But the forecasting is complex due to the time series and complex factor inputs. The neural network method has been successfully employed to solve the multi factors problem. However the forecasting result is not ideal due to the nonlinearity and noise. In this work, a neural network model is presented by combining Recurrent Neural Network (RNN) with Kernel Principal Component Analysis (KPCA). And then try to forecast the area flowing of graduate employment using this model. In the model, RNN with Kernel Principal Component Analysis (KPCA) and Principal Component Analysis (PCA) as the feature extraction is introduced in as comparison. And then by an empirical study with actual data from some high school of China, it is shown that the proposed methods can both achieve good forecasting performance comparing with NN method. And the KPCA method performs better than the PCA method.

  • Conference Article
  • Cite Count Icon 10
  • 10.1109/icmlc.2007.4370750
An Iterative Algorithm for Robust Kernel Principal Component Analysis
  • Jan 1, 2007
  • Lei Wang + 3 more

Principal component analysis (PCA) has been proven to be an efficient method in dimensionality reduction, feature extraction and pattern recognition. Kernel principal component analysis (KPCA) can be considered as a natural nonlinear generalization of PCA, which performs linear PCA in a high dimensional space implicitly by using kernel trick. However, both conventional PCA and KPCA suffer from the deficiency of being sensitive to outliers. Existing robust KPCA has to eigen-decompose the gram matrix directly in each step and is much more computationally infeasible due to the large size of the matrix when the number of training samples is large. By extending existing robust PCA algorithm using kernel methods, we present a novel robust adaptive algorithm for calculating the kernel principal components. The proposed method not only preserves the characteristic of capturing underlying nonlinear structure of KPCA but also is robust against outliers by restraining the effect of outlying samples. Compared with existing robust KPCA methods, our method is performed without having to store the kernel matrix, which can reduce significantly the storage burden. In addition, our method shows the potential of expansibility to the incremental learning version. Experimental results on synthetic data indicate that our improved algorithm is effective and promising.

  • Conference Article
  • 10.1109/wicom.2008.1716
Forecasting the Area Flowing of Graduate Employment Based on KRNN method
  • Oct 1, 2008
  • Chengsong Qing + 1 more

It is hard to search the influence variables and to forecast the flowing of graduate employment due to the time series and complex factor inputs. Recently the neural network method has been successfully employed to solve the problem. However the forecasting result is not ideal due to the nonlinearity and noise. In this work, by combining Recurrent Neural Network (RNN) with Kernel Principal Component Analysis (KPCA), a KRNN model is presented, based on which, the area flowing of graduate employment is tried to be forecasted, and both the complex factor problem and time series problem has been dealt with. In the model, RNN with Kernel Principal Component Analysis (KPCA) and Principal Component Analysis (PCA) as the feature extraction is introduced in as comparison. And then by an empirical study with actual data from some high school of China, it is shown that the proposed methods can both achieve good forecasting performance comparing with NN method. And the KPCA method performs better than the PCA method.

  • Research Article
  • Cite Count Icon 4
  • 10.1109/embc.2014.6944344
Sparse kernel entropy component analysis for dimensionality reduction of neuroimaging data.
  • Aug 1, 2014
  • Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
  • Qikun Jiang + 1 more

The neuroimaging data typically has extremely high dimensions. Therefore, dimensionality reduction is commonly used to extract discriminative features. Kernel entropy component analysis (KECA) is a newly developed data transformation method, where the key idea is to preserve the most estimated Renyi entropy of the input space data set via a kernel-based estimator. Despite its good performance, KECA still suffers from the problem of low computational efficiency for large-scale data. In this paper, we proposed a sparse KECA (SKECA) algorithm with the recursive divide-and-conquer based solution, and then applied it to perform dimensionality reduction of neuroimaging data for classification of the Alzheimer's disease (AD). We compared the SKECA with KECA, principal component analysis (PCA), kernel PCA (KPCA) and sparse KPCA. The experimental results indicate that the proposed SKECA has most superior performance to all other algorithms when extracting discriminative features from neuroimaging data for AD classification.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 3
  • 10.1155/2016/7263285
Fault Diagnosis Method on Polyvinyl Chloride Polymerization Process Based on Dynamic Kernel Principal Component and Fisher Discriminant Analysis Method
  • Jan 1, 2016
  • Mathematical Problems in Engineering
  • Shu-Zhi Gao + 4 more

In view of the fact that the production process of Polyvinyl chloride (PVC) polymerization has more fault types and its type is complex, a fault diagnosis algorithm based on the hybrid Dynamic Kernel Principal Component Analysis-Fisher Discriminant Analysis (DKPCA-FDA) method is proposed in this paper. Kernel principal component analysis and Dynamic Kernel Principal Component Analysis are used for fault diagnosis of Polyvinyl chloride (PVC) polymerization process, while Fisher Discriminant Analysis (FDA) method was adopted to make failure data for further separation. The simulation results show that the Dynamic Kernel Principal Component Analyses to fault diagnosis of Polyvinyl chloride (PVC) polymerization process have better diagnostic accuracy, the Fisher Discriminant Analysis (FDA) can further realize the fault isolation, and the actual fault in the process of Polyvinyl chloride (PVC) polymerization production can be monitored by Dynamic Kernel Principal Component Analysis.

  • Research Article
  • 10.4028/www.scientific.net/amr.255-260.2855
Classification of Area Flowing Based on KRNN Method
  • May 1, 2011
  • Advanced Materials Research
  • Xiang Sun

It is hard to search the influence variables and to classify the flowing areas of graduate employment due to the complex factor inputs. Recently the neural network method has been successfully employed to solve the problem. However the classification result is not ideal due to the nonlinearity and noise. In this work, by combining Recurrent Neural Network (RNN) with Kernel Principal Component Analysis (KPCA), a KRNN model is presented, based on which, the flowing areas of graduate employment is tried to be classified, and the complex factor problem has been well dealt with. In the model, RNN with Kernel Principal Component Analysis (KPCA) and Principal Component Analysis (PCA) as the feature extraction is introduced in as comparison. And then by an empirical study with actual data, it is shown that the proposed methods can both achieve good classification performance comparing with NN method. And the Kernel Principal Component Analysis method performs better than the Principal Component Analysis method.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/cis.2007.55
An Adaptive Join Strategy in Distributed Data Stream Management System
  • Dec 1, 2007
  • Xiaojing Li + 3 more

For the complexity of the automation system roll on, sensors should have to be more intelligent. Recently, Neural Network is widely used to intelligentize sensors for its well performance on capturing the information of the data. But due to its intrinsic linear character, it doesn't perform well in nonlinear data processing. In this paper, RNN with Kernel Principal Component Analysis (KPCA) and Principal Component Analysis (PCA) as the feature extraction is introduced in as comparison. And then an experimental system is set up with pressure sensor. By examining the data of the example, it is shown that the proposed methods can both achieve good performance comparing with NN method. And the KPCA method performs better than the PCA method.

  • Book Chapter
  • Cite Count Icon 4
  • 10.5772/9367
Non-Linear Feature Extraction by Linear Principal Component Analysis Using Local Kernel
  • Feb 1, 2010
  • Kazuhiro Hotta

In the last decade, the effectiveness of kernel-based methods for object detection and recognition have been reported Fukui et al. (2006); Hotta (2008c); Kim et al. (2002); Pontil & Verri (1998); Shawe-Taylor & Cristianini (2004); Yang (2002). In particular, Kernel Principal Component Analysis (KPCA) took the place of traditional linear PCA as the first feature extraction step in various researches and applications. KPCA can cope with non-linear variations well. However, KPCAmust solve the eigen value problem with the number of samples × the number of samples. In addition, the computation of kernel functions with all training samples are required to map a test sample to the subspace obtained by KPCA. Therefore, the computational cost is the main drawback. To reduce the computational cost of KPCA, sparse KPCA Tipping (2001) and the use of clustering Ichino et al. (2007 (in Japanese) were proposed. Ichino et al. Ichino et al. (2007 (in Japanese) reported that KPCA of cluster centers is more effective than sparse KPCA. However, the computational cost becomes a big problem again when the number of classes is large and each class has one subspace. For example, KPCA of visual words (cluster centers of local features) Hotta (2008b) was effective for object categorization but the computational cost is high. In this method, each category of 101 categories has one subspace constructed by 400 visual words. Namely, 40, 400 (= 101 categorizes × 400 visual words) kernel computations are required to map a local feature to all subspaces. On the other hand, traditional linear PCA is independent of the number of samples when the dimension of a feature is smaller than the number of samples. This is because the size of eigen value problem depends on the minimum number of the feature dimension and the number of samples. To map a test sample to a subspace, only inner products between basis vectors and the test sample are required. Therefore, in general, the computational cost of linear PCA is much lower than KPCA. In this paper, we propose how to use non-linearity of KPCA and computational cost of linear PCA simultaneously Hotta (2008a). Kernel-based methods map training samples to high dimensional space as x → φ(x). Nonlinearity is realized by linear method in high dimensional space. The dimension of mapped feature space of the Radial Basis Function (RBF) kernel becomes infinity, and we can not describe the mapped feature explicitly. However, the mapped feature φ(x) of the polynomial kernel can be described explicitly. This means that KPCA with the polynomial kernel can be solved directly by linear PCA of mapped features. Unfortunately, in general, the dimension of mapped features is too high to solve by linear PCA even if the polynomial kernel with 2nd degrees K(x, y) = (1+ xTy)2 is used. The dimension of mapped features of the polynomial 5

  • Book Chapter
  • Cite Count Icon 7
  • 10.5772/9353
Bi-2DPCA: A Fast Face Coding Method for Recognition
  • Feb 1, 2010
  • Jian Yang + 2 more

Face recognition has received significant attention in the past decades due to its potential applications in biometrics, information security, law enforcement, etc. Numerous methods have been suggested to address this problem [1]. Among appearance-based holistic approaches, principal component analysis (PCA) turns out to be very effective. As a classical unsupervised learning and data analysis technique, PCA was first used to represent images of human faces by Sirovich and Kirby in 1987 [2, 3]. Subsequently, Turk and Pentland [4, 5] applied PCA to face recognition and presented the well-known Eigenfaces method in 1991. Since then, PCA has been widely investigated and has become one of the most successful approaches to face recognition [6-15]. PCA-based image representation and analysis technique is based on image vectors. That is, before applying PCA, the given 2D image matrices must be mapped into 1D image vectors by stacking their columns (or rows). The resulting image vectors generally lead to a highdimensional image vector space. In such a space, calculating the eigenvectors of the covariance matrix is a critical problem deserving consideration. When the number of training samples is smaller than the dimension of images, the singular value decomposition (SVD) technique is useful for reducing the computational complexity [1-4]. However, when the training sample size becomes large, the SVD technique is helpless. To deal with this problem, an incremental principal component analysis (IPCA) technique has been proposed recently [16]. But, the efficiency of this algorithm still depends on the distribution of data. Over the last few years, two PCA-related methods, independent component analysis (ICA) [17] and kernel principal component analysis (KPCA) [18, 19] have been of wide concern. Bartlett [20], Yuen [21], Liu [22], and Draper [23] proposed using ICA for face representation and found that it was better than PCA when cosine was used as the similarity measure (however, the performance difference between ICA and PCA was not significant if the Euclidean distance is used [23]). Yang [24] and Liu [25] used KPCA for face feature extraction and recognition and showed that KPCA outperforms the classical PCA. Like PCA, ICA and KPCA both follow the matrix-to-vector mapping strategy when they are used for image analysis and, their algorithms are more complex than PCA. So, ICA and KPCA are considered to be computationally more expensive than PCA. The experimental results in 16

  • Conference Article
  • Cite Count Icon 4
  • 10.1145/2442985.2442987
Temporal data mining of uncertain water reservoir data
  • Nov 6, 2012
  • Abhinaya Mohan + 1 more

This paper describes the challenges of data mining uncertain water reservoir data based on past human operations in order to learn from them reservoir policies that can be automated for the future operation of the water reservoirs. Records of human operations of water reservoirs often contain uncertain data. For example, the recorded amounts of water released and retained in the water reservoirs are typically uncertain, i.e., they are bounded by some minimum and maximum values. Moreover, the time of release is also uncertain, i.e., typically only monthly or weekly amounts are recorded. To increase the effectiveness of data mining of uncertain water reservoir data, temporal data mining with inflow and rainfall data from several prior months was used. The experiments also compared several different data classification methods for robustness in the case of uncertain data.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.