Large High-dimensional Datasets Research Articles

Online financial transactions bring convenience to people’s lives, but also present vulnerabilities for criminals to embezzle users’ accounts and trick users into credit card fraud. Although machine learning methods have been adopted to detect anomalous transactions, it’s hard for a single machine learning method to achieve satisfying results with the increasing scale and dimensionality of financial datasets. In addition, for anomaly detection of financial data, there is an obvious imbalance between normal records and abnormal. In this situation, the experimental results cannot be objectively evaluated only by the traditional metrics, such as precision, recall, and accuracy. This paper proposes an AutoEncoder enhanced LightGBM method for credit card detection. The method inherits the advantages of each component, using an AutoEncoder for feature reconstruction on the dataset, and integrating the LightGBM algorithm for improving the GBDT (Gradient Boosting Decison Tree) to detect abnormal data more accurately and efficiently. Besides the traditional evaluation metrics, F-measure, area under curve (AUC), Matthew’s correlation coefficient (MCC), and balanced classification rate (BCR) are also adopted as the evaluation metrics. Two financial datasets were used to validate the performance and robustness of the proposed model. Results obtained from the credit card fraud dataset containing 31 features indicate that our model significantly outperforms other models with a recall of 94.85%, representing a 10.70% improvement compared to the best detection performance model with a recall of only 86%. Additionally, our model’s BCR score is also significantly better than other models, with a BCR score of 97%, as opposed to the best detection performance model’s BCR score of 92%, representing a 5% improvement by our model. Various sampling methods and model combinations were considered in this study. It was found that the SMOTE algorithm combined with the proposed model produced the best results, with an AUC value of 96.83% and an F-measure score of 80.27%. The Santander bank transaction record dataset is a high dimensional large dataset containing 200 features. Experimental results on this dataset reveal that compared to other models, our model significantly improved recall and F-measure results, raising the recall to 94.14% and the F-measure score by 11.51%, surpassing the second-best-performing model. Overall, these findings demonstrate the robustness and superiority of our model in detecting fraudulent transactions and highlight the effectiveness of the SMOTE algorithm in combination with the proposed model.

Read full abstract

INTRODUCTION: Gene expression data analysis is a critical aspect of disease prediction and classification, playing a pivotal role in the field of bioinformatics and biomedical research. High-dimensional gene expression datasets hold a wealth of information, but their effective utilization is hindered by the presence of irrelevant dimensions and noise. The challenge lies in extracting meaningful features from these datasets to enhance the accuracy of disease prediction and classification while maintaining computational efficiency. Feature selection is a crucial step in addressing these challenges, as it aims to identify and retain only the most informative characteristics from large high-dimensional microarray datasets. In the context of microarray gene expression data, characterized by its substantial dimensionality, selecting relevant features is essential for efficient nearest neighbor search, a fundamental component of various analytical tasks in bioinformatics and data mining. Existing feature selection methods in high-dimensional data often face issues related to the trade-off between search accuracy and computational efficiency. This paper introduces a novel approach, the Nearest Neighbor Feature Selection with Symmetrical Uncertainty-based Redundancy Removal (NNFSRR) method, designed to enhance the classification of microarray gene expression data through feature selection. The NNFSRR method focuses on reducing the dimensionality of the dataset by identifying and removing redundant features, allowing subsequent searches to operate solely on relevant dimensions. OBJECTIVES: The primary goal is to evaluate the NNFSRR method's effectiveness in improving nearest neighbor search in microarray gene expression datasets by reducing dimensionality. This method utilizes Symmetrical Uncertainty-based correlation between dimensions for feature selection and aims to enhance accuracy and efficiency compared to existing methods. METHODS: The NNFSRR method uses Symmetrical Uncertainty to identify and remove redundant features from microarray gene expression datasets. Reduced datasets are used for nearest neighbor search, improving accuracy and efficiency. Experiments are conducted using real-world datasets, and comparisons with existing methods are made based on search time and accuracy. RESULTS: The NNFSRR method demonstrates improved nearest neighbor search performance, outperforming basic brute force methods and existing feature selection techniques. Selected feature sets exhibit strong class associations while minimizing feature correlations, enhancing classification precision. CONCLUSION: In conclusion, the NNFSRR method presents a promising approach to address the challenges posed by high-dimensional gene expression data. It effectively reduces dimensionality, improves search accuracy, and enhances the efficiency of nearest neighbor search. Our experimental results demonstrate that this method outperforms existing techniques in terms of search time and accuracy, making it a valuable tool for applications in bioinformatics, data mining, pattern recognition, and biological information retrieval. The NNFSRR method holds the potential to advance our understanding of complex biological processes and support more accurate disease prediction and classification.

Read full abstract

Large High-dimensional Datasets Research Articles

Articles published on Large High-dimensional Datasets

An AutoEncoder enhanced light gradient boosting machine method for credit card fraud detection

Robust Multiple Linear Backward EliminationRegression

Co-clustering contaminated data: a robust model-based approach

NNFSRR: Nearest Neighbor Feature Selection and Redundancy Removal Method for Nearest Neighbor Search in Microarray Gene Expression Data

Addressing the class-imbalance and class-overlap problems by a metaheuristic-based under-sampling approach

A Self-Supervised Fault Detection for UAV Based on Unbalanced Flight Data Representation Learning and Wavelet Analysis

On randomized sketching algorithms and the Tracy–Widom law

Estimating the Number of Clusters in High-Dimensional Large Datasets

A review on recent machine learning applications for imaging mass spectrometry studies

A Novel Hybrid Grey Wolf Optimization Algorithm using Two-Phase Crossover Approach for Feature Selection and Classification

LANNS

Fast spectral clustering method based on graph similarity matrix completion

Integration, exploration, and analysis of high-dimensional single-cell cytometry data using Spectre.

The CIPCA-BPNN Failure Prediction Method Based on Interval Data Compression and Dimension Reduction

Pairwise-Covariance Multi-view Discriminant Analysis for Robust Cross-View Human Action Recognition

Prediction of Hyperuricemia Risk Based on Medical Examination Report Analysis

Visualization of very large high-dimensional data sets as minimum spanning trees

CFOF

LShape Partitioning: Parallel Skyline Query Processing using MapReduce

Urban green economic development indicators based on spatial clustering algorithm and blockchain

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Large High-dimensional Datasets Research Articles

Articles published on Large High-dimensional Datasets

An AutoEncoder enhanced light gradient boosting machine method for credit card fraud detection

Robust Multiple Linear Backward EliminationRegression

Co-clustering contaminated data: a robust model-based approach

NNFSRR: Nearest Neighbor Feature Selection and Redundancy Removal Method for Nearest Neighbor Search in Microarray Gene Expression Data

Addressing the class-imbalance and class-overlap problems by a metaheuristic-based under-sampling approach

A Self-Supervised Fault Detection for UAV Based on Unbalanced Flight Data Representation Learning and Wavelet Analysis

On randomized sketching algorithms and the Tracy–Widom law

Estimating the Number of Clusters in High-Dimensional Large Datasets

A review on recent machine learning applications for imaging mass spectrometry studies

A Novel Hybrid Grey Wolf Optimization Algorithm using Two-Phase Crossover Approach for Feature Selection and Classification

LANNS

Fast spectral clustering method based on graph similarity matrix completion

Integration, exploration, and analysis of high-dimensional single-cell cytometry data using Spectre.

The CIPCA-BPNN Failure Prediction Method Based on Interval Data Compression and Dimension Reduction

Pairwise-Covariance Multi-view Discriminant Analysis for Robust Cross-View Human Action Recognition

Prediction of Hyperuricemia Risk Based on Medical Examination Report Analysis

Visualization of very large high-dimensional data sets as minimum spanning trees

CFOF

LShape Partitioning: Parallel Skyline Query Processing using MapReduce

Urban green economic development indicators based on spatial clustering algorithm and blockchain