CNV-Finder: Streamlining Copy Number Variation Discovery.

Nicole Kuznetsov,Kensuke Daida,Mary B Makarious,Bashayer Al-Mubarak,Kajsa Atterling Brolin,Laksh Malik,Cedric Kouam,Breeana Baker,Miriam Ostrozovicova,Katherine M Andersh,Pin-Jui Kung,Yasser Mecheri,Yi-Wen Tay,Behloul Soundous Malek,Nada Al Tassan,Maria Teresa Perinan,Samantha Hong,Mathew Koretsky,Lana Sargeant,Kristin Levine,Cornelis Blauwendraat,Kimberley J Billingsley,Sara Bandres-Ciga,Hampton L Leonard,Huw R Morris,Andrew B Singleton,Mike A Nalls,Dan Vitale,The Global Parkinson's Genetics Program

doi:10.1101/2024.11.22.624040

Abstract

Copy Number Variations (CNVs) play pivotal roles in the etiology of complex diseases and are variable across diverse populations. Understanding the association between CNVs and disease susceptibility is of significant importance in disease genetics research and often requires analysis of large sample sizes. One of the most cost-effective and scalable methods for detecting CNVs is based on normalized signal intensity values, such as Log R Ratio (LRR) and B Allele Frequency (BAF), from Illumina genotyping arrays. In this study, we present CNV-Finder, a novel pipeline integrating deep learning techniques on array data, specifically a Long Short-Term Memory (LSTM) network, to expedite the large-scale identification of CNVs within predefined genomic regions. This facilitates the efficient prioritization of samples for subsequent, costly analyses such as short-read and long-read whole genome sequencing. We focus on five genes-Parkin (PRKN), Leucine Rich Repeat And Ig Domain Containing 2 (LINGO2), Microtubule Associated Protein Tau (MAPT), alpha-Synuclein (SNCA), and Amyloid Beta Precursor Protein (APP)-which may be relevant to neurological diseases such as Alzheimer's disease (AD), Parkinson's disease (PD), or related disorders such as essential tremor (ET). By training our models on expert-annotated samples and validating them across diverse cohorts, including those from the Global Parkinson's Genetics Program (GP2) and additional dementia-specific databases, we demonstrate the efficacy of CNV-Finder in accurately detecting deletions and duplications. Our pipeline outputs app-compatible files for visualization within CNV-Finder's interactive web application. This interface enables researchers to review predictions and filter displayed samples by model prediction values, LRR range, and variant count in order to explore or confirm results. Our pipeline integrates this human feedback to enhance model performance and reduce false positive rates. Through a series of comprehensive analyses and validations using both short-read and long-read sequencing data, we demonstrate the robustness and adaptability of CNV-Finder in identifying CNVs with regions of varied sparsity, noise, and size. Our findings highlight the significance of contextual understanding and human expertise in enhancing the precision of CNV identification, particularly in complex genomic regions like 17q21.31. The CNV-Finder pipeline is a scalable, publicly available resource for the scientific community, available on GitHub (https://github.com/GP2code/CNV-Finder; DOI 10.5281/zenodo.14182563). CNV-Finder not only expedites accurate candidate identification but also significantly reduces the manual workload for researchers, enabling future targeted validation and downstream analyses in regions or phenotypes of interest.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

CNV-Finder: Streamlining Copy Number Variation Discovery.

Abstract

Talk to us

Similar Papers

More From: bioRxiv : the preprint server for biology

Lead the way for us

Journal: bioRxiv : the preprint server for biology	Publication Date: Nov 23, 2024
License type: CC0 1.0

Similar Papers

An Integrative Segmentation Method for Detecting Germline Copy Number Variations in SNP Arrays
Jianxin Shi ... Peng Li
Genetic Epidemiology | VOL. 36
Jianxin Shi, et. al.Jianxin Shi ... Peng Li
26 Apr 2012
Genetic Epidemiology | VOL. 36

Investigating the Genetics of Sporadic Early‐Onset Alzheimer’s Disease
Kelly N H Nudelman ... Maria C Carrillo
Alzheimer's & Dementia | VOL. 19
Kelly N H Nudelman, et. al.Kelly N H Nudelman ... Maria C Carrillo
01 Dec 2023
Alzheimer's & Dementia | VOL. 19

Genome-wide association analyses of carcass traits using copy number variants and raw intensity values of single nucleotide polymorphisms in cattle
Pierce Rafter ... Andrew C Parnell
BMC Genomics | VOL. 22
Pierce Rafter, et. al.Pierce Rafter ... Andrew C Parnell
23 Oct 2021
BMC Genomics | VOL. 22

Loss of mismatched HLA detected in the peripheral blood of an AML patient who relapsed after haploidentical hematopoietic stem cell transplantation.
Borae G Park ... Yong-Hak Sohn
Annals of laboratory medicine | VOL. 35
Borae G Park, et. al.Borae G Park ... Yong-Hak Sohn
15 Jul 2015
Annals of laboratory medicine | VOL. 35

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

CNV-Finder: Streamlining Copy Number Variation Discovery.

Abstract

Talk to us

Similar Papers

More From: bioRxiv : the preprint server for biology