Feature engineering solution with structured query language analytic functions in detecting electricity frauds using machine learning

Simona-Vasilica Oprea,Adela Bâra

doi:10.1038/s41598-022-07337-7

Simona-Vasilica Oprea, Adela Bâra

Open Access

https://doi.org/10.1038/s41598-022-07337-7

Copy DOI

Journal: Scientific Reports	Publication Date: Feb 28, 2022
Citations: 10	License type: open-access

Affiliation: Bucharest University of Economic Studies

Abstract

Detecting fraud related to electricity consumption is usually a difficult challenge as the input datasets are sometimes unreliable due to missing and inconsistent records, faults, misinterpretation of meter reading remarks, status, etc. In this paper, we obtain meaningful insights from fraud detection using real datasets of Tunisian electricity consumption metered by conventional meters. We propose an extensive feature engineering approach using the structured query language (SQL) analytic functions. Furthermore, double merging of datasets reveals more dimensions of the data allowing better detection of irregularities in consumption. We analyze the results of several machine learning (ML) algorithms that manage cases of weakly correlated features and highly unbalanced datasets. The skewness of the target is approached as a regular characteristic of the input data because most of consumers are fair and only a small portion attempt to mislead the utility companies by tampering with metering devices. Our fraud detection solutions consist of combining classifiers with an anomaly detection feature obtained with an unsupervised ML algorithm—Isolation Forest, and extensive feature engineering using SQL analytic functions on large datasets. Several techniques for feature processing enhanced the Area Under the Curve score for Decision Tree algorithm from 0.68 to 0.99.

Full Text