Protecting Genetic Genealogical Databases from Identical-by-State Probing Attacks: A Machine Learning-Based Approach

Enow T. A. A. Enow T. A. A.,Lum A. F. Lum A. F.

doi:10.9734/bji/2023/v27i6707

Abstract

Identical-by-state (IBS) probing is a way of attacking a public genealogy database to discover the identities of people who share specific qualities. The attacker creates an IBS-inert DNA sequence (IBS-inert-DNA-sequence) and combines it with a sequence containing the trait of interest. The final sequence matches people with genomic areas similar to the trait. To prevent attacks, it was hypothesized that the design of IBS-inert-DNA-sequences is based on the principle that they are susceptible to detection by skilled machine learning systems, because the attacker purposely creates an IBS-inert-DNA-sequence which is structurally dissimilar to real DNA sequences. The dataset consisted of real DNA (from the UCI Machine Learning Repository's splice junction gene sequences dataset) and computer-generated sequences. Eighteen non-identical Random Forest classifier models (RF-models) were created to determine the best configurations for discriminating between real and computer-generated sequences. The findings revealed that an optimized RF-model combined with k-mer and n-gram values of 2 each, resulted in the most performant model, with accuracy, sensitivity, specificity, false positive rate, Matthew’s correlation coefficient, and area under the receiver operating characteristic curve values of 88.3%, 84.8%, 91.8%, 8.2%, 0.768 and 0.958, respectively. A decline in performance was linked to an increase in k-mer size.

Full Text