It might be instinctively assumed that the occurrence of the first digit of a randomly selected number is uniformly distributed among 1 to 9. However, the Newcomb–Benford law (NBL), also known as the first-digit law or Benford’s law, reveals a completely different fact: the distributions follows a logarithm-type law. This bias in the frequency of occurrence of the leading digits of numbers is evident in everything from the lengths of rivers, the areas of countries, stock prices, and population numbers. Omnipresence of NBL emerges in mathematics (the Fibonacci sequence, the distribution of prime numbers, etc.) and also in the field of physics (i.e. the half-lives of unstable nuclei). From the earliest years, a large bibliography has been developed and applied to many fields such as fraud detection in finances, international trade and election results analysis. Usually, the diagnosis of such works is reduced to a qualitative binary assessment (conformity or non-conformity). The present paper aims to establish a supervised methodology to address existing gaps and enhance the reliability of final decisions. These gaps include uncertainty about the minimum data volume required to validate Benford’s Law, the low sensitivity of current metrics (Euclidean distance or Mean Absolute Difference) to detect minor levels of adulteration, and the need to quantify both the level and nature of adulteration. Two new procedures have been established coming from Signal Analysis and Processing theory. The first relates the deconvolution between the empirical leading digit distribution and the Benford one. The second procedure consists of obtaining the generating functions (via Hilbert and Fourier transforms) to increase the contrast between both functions: the empirical under research and Benford’s as the reference. Based on calibrated datasets, we are able to quantify the magnitude of accidental errors or malicious fraud adulteration and make the discrimination between both. Finally, as benchmark, fifteen real-world datasets were studied. This study shows that most of datasets are suspicious of Fraud. Notably, in the field of human health, several datasets demonstrate significant fraudulent deviations, defying conventional patterns of fraud and error. This phenomenon has been termed Berserker Fraud.
Read full abstract