Pangenome-Informed Language Models for Privacy-Preserving Synthetic Genome Sequence Generation.

Pengzhi Huang,François Charton,Jan-Niklas M Schmelzle,Shelby S Darnell,Pjotr Prins,Erik Garrison,G Edward Suh

doi:10.1101/2024.09.18.612131

Abstract

The public availability of genome datasets, such as The Human Genome Project (HGP), The 1000 Genomes Project, The Cancer Genome Atlas, and the International HapMap Project, has significantly advanced scientific research and medical understanding. Here our goal is to share such genomic information for downstream analysis while protecting the privacy of individuals through Differential Privacy (DP). We introduce synthetic DNA data generation based on pangenomes in combination with Pretrained-Language Models (PTLMs). We introduce two novel tokenization schemes based on pangenome graphs to enhance the modeling of DNA. We evaluated these tokenization methods, and compared them with classical single nucleotide and k -mer tokenizations. We find k -mer tokenization schemes, indicating that our tokenization schemes boost the model's performance consistency with long effective context length (covering longer sequences with the same number of tokens). Additionally, we propose a method to utilize the pangenome graph and make it comply with DP privacy standards. We assess the performance of DP training on the quality of generated sequences with discussion of the trade-offs between privacy and model accuracy. The source code for our work will be published under a free and open source license soon.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Pangenome-Informed Language Models for Privacy-Preserving Synthetic Genome Sequence Generation.

Abstract

Talk to us

Similar Papers

More From: bioRxiv : the preprint server for biology

Lead the way for us

Journal: bioRxiv : the preprint server for biology	Publication Date: Sep 24, 2024
License type: CC BY-NC-ND 4.0

Similar Papers

Genetics, genomics and beyond
Tim Harris
Trends in Molecular Medicine | VOL. 7
Tim HarrisTim Harris
25 Oct 2001
Trends in Molecular Medicine | VOL. 7

The 1000 Genomes Project: paving the way for personalized genomic medicine.
Ian B Gibson ... Fuli Yu
Personalized Medicine | VOL. 10
Ian B Gibson, et. al.Ian B Gibson ... Fuli Yu
01 Jun 2013
The 1000 Genomes Project: paving the way for personalized genomic medicine.
Ian B Gibson ... Fuli Yu

Innovative technology for cancer risk analysis
S Tommas ... S De Summa
Annals of Oncology | VOL. 22
S Tommas, et. al.S Tommas ... S De Summa
01 Jan 2010
Annals of Oncology | VOL. 22

Local Differential Privacy in the Medical Domain to Protect Sensitive Information: Algorithm Development and Real-World Validation.
Mindong Sung ... Yu Rang Park
JMIR medical informatics | VOL. 9
Mindong Sung, et. al.Mindong Sung ... Yu Rang Park
08 Nov 2021
JMIR medical informatics | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Pangenome-Informed Language Models for Privacy-Preserving Synthetic Genome Sequence Generation.

Abstract

Talk to us

Similar Papers

More From: bioRxiv : the preprint server for biology