Abstract

Identical-by-state (IBS) probing is a way of attacking a public genealogy database to discover the identities of people who share specific qualities. The attacker creates an IBS-inert DNA sequence (IBS-inert-DNA-sequence) and combines it with a sequence containing the trait of interest. The final sequence matches people with genomic areas similar to the trait. To prevent attacks, it was hypothesized that the design of IBS-inert-DNA-sequences is based on the principle that they are susceptible to detection by skilled machine learning systems, because the attacker purposely creates an IBS-inert-DNA-sequence which is structurally dissimilar to real DNA sequences. The dataset consisted of real DNA (from the UCI Machine Learning Repository's splice junction gene sequences dataset) and computer-generated sequences. Eighteen non-identical Random Forest classifier models (RF-models) were created to determine the best configurations for discriminating between real and computer-generated sequences. The findings revealed that an optimized RF-model combined with k-mer and n-gram values of 2 each, resulted in the most performant model, with accuracy, sensitivity, specificity, false positive rate, Matthew’s correlation coefficient, and area under the receiver operating characteristic curve values of 88.3%, 84.8%, 91.8%, 8.2%, 0.768 and 0.958, respectively. A decline in performance was linked to an increase in k-mer size.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.