Abstract

Identical-by-state (IBS) probing is a way of attacking a public genealogy database to discover the identities of people who share specific qualities. The attacker creates an IBS-inert DNA sequence (IBS-inert-DNA-sequence) and combines it with a sequence containing the trait of interest. The final sequence matches people with genomic areas similar to the trait. To prevent attacks, it was hypothesized that the design of IBS-inert-DNA-sequences is based on the principle that they are susceptible to detection by skilled machine learning systems, because the attacker purposely creates an IBS-inert-DNA-sequence which is structurally dissimilar to real DNA sequences. The dataset consisted of real DNA (from the UCI Machine Learning Repository's splice junction gene sequences dataset) and computer-generated sequences. Eighteen non-identical Random Forest classifier models (RF-models) were created to determine the best configurations for discriminating between real and computer-generated sequences. The findings revealed that an optimized RF-model combined with k-mer and n-gram values of 2 each, resulted in the most performant model, with accuracy, sensitivity, specificity, false positive rate, Matthew’s correlation coefficient, and area under the receiver operating characteristic curve values of 88.3%, 84.8%, 91.8%, 8.2%, 0.768 and 0.958, respectively. A decline in performance was linked to an increase in k-mer size.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call