Abstract

In the era of big data, next-generation sequencing produces a large amount of genomic data. With these genetic sequence data, research in biology fields will be further advanced. However, the growth of data scale often leads to privacy issues. Even if the data is not open, it is still possible for an attacker to steal private information by a member inference attack. In this paper, we proposed a private profile hidden Markov model (PHMM) with differential identifiability for gene sequence clustering. By adding random noise into the model, the probability of identifying individuals in the database is limited. The gene sequences could be unsupervised clustered without labels according to the output scores of private PHMM. The variation of the divergence distance in the experimental results shows that the addition of noise makes the profile hidden Markov model distort to a certain extent, and the maximum divergence distance can reach 15.47 when the amount of data is small. Also, the cosine similarity comparison of the clustering model before and after adding noise shows that as the privacy parameters changes, the clustering model distorts at a low or high level, which makes it defend the member inference attack.

Highlights

  • In recent years, with the development of IoT-based gene sequencing technology, the amount of biological gene sequence data has increased rapidly [1]. e biological sequence data contains information about species evolution, genetic traits, and potential diseases in genes

  • E concern about privacy has come at the time of the surge in data. e personal identifiable private information contained in the gene sequences is easy to be used by attackers, and the individual unique information extracted from the gene fragments will lead to the disclosure of private information [3]. erefore, it is necessary to develop privacy protection technology for genomic data to ensure that private information will not be stolen

  • We show the iterative allocation method of differential identifiability privacy parameters that can be used to add noise to profile hidden Markov model (PHMM)

Read more

Summary

Introduction

With the development of IoT-based gene sequencing technology, the amount of biological gene sequence data has increased rapidly [1]. e biological sequence data contains information about species evolution, genetic traits, and potential diseases in genes. Differential privacy is a definition of privacy based on strict mathematical proof It defines privacy as the difference between the output of two neighbor databases, Security and Communication Networks which is inconsistent with the relevant privacy regulations, such as the U.S HIPAA safe harbor rule [6]. To solve this problem, Lee and Clifton [7] put forward the concept of differential identifiability (DI), which defined privacy as the probability of an individual being identified by an attacker in the database, which is more consistent with people’s cognition of privacy. A privacy clustering algorithm based on the profile hidden Markov model (PHMM) is proposed. E experimental results show that if the privacy parameters are properly set, the proposed model is still usable after adding noise

Related Work
Preliminary
Profile Hidden Markov Model with Differential Identifiability
Gene Sequence Clustering Algorithm Based on DI-PHMM
Experimental Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.