Abstract
Genetic population assignment used to inform wildlife management and conservation efforts requires panels of highly informative genetic markers and sensitive assignment tests. We explored the utility of machine‐learning algorithms (random forest, regularized random forest and guided regularized random forest) compared with F ST ranking for selection of single nucleotide polymorphisms (SNP) for fine‐scale population assignment. We applied these methods to an unpublished SNP data set for Atlantic salmon (Salmo salar) and a published SNP data set for Alaskan Chinook salmon (Oncorhynchus tshawytscha). In each species, we identified the minimum panel size required to obtain a self‐assignment accuracy of at least 90% using each method to create panels of 50–700 markers Panels of SNPs identified using random forest‐based methods performed up to 7.8 and 11.2 percentage points better than F ST‐selected panels of similar size for the Atlantic salmon and Chinook salmon data, respectively. Self‐assignment accuracy ≥90% was obtained with panels of 670 and 384 SNPs for each data set, respectively, a level of accuracy never reached for these species using F ST‐selected panels. Our results demonstrate a role for machine‐learning approaches in marker selection across large genomic data sets to improve assignment for management and conservation of exploited populations.
Highlights
Genetic assignment of individuals to their source populations is useful for uncovering the spatial distribution of populations and migration patterns (e.g., André et al, 2016) relevant to wildlife management and conservation (Manel, Gaggiotti, & Waples, 2005)
We further reduced our panel for downstream feature selection by removing redundant single nucleotide polymorphisms (SNP) and SNPs in linkage disequilibrium using the genepop_toploci function in the R package Genepopedit (Stanley, Jeffery, Wringe, DiBacco, & Bradbury, 2016) at an R2 threshold of 0.2 and a minimum global FST of 0.05
As the actual mean decrease in accuracy (MDA) value indicates relative importance in the per cent decrease in accuracy to the model, a strict cut-off threshold will vary for each data set, depending on how well the population can be inferred by a SNP
Summary
Genetic assignment of individuals to their source populations is useful for uncovering the spatial distribution of populations and migration patterns (e.g., André et al, 2016) relevant to wildlife management and conservation (Manel, Gaggiotti, & Waples, 2005). Atlantic and Chinook salmon are species that exemplify opportunities, challenges and applications associated with selecting panels of genetic markers for efficient self-assignment to source populations Both species are widely distributed, extensively exploited, and of particular conservation concern in parts of their ranges (Bradbury, Hamilton, Dempson, et al, 2015; Bradbury et al, 2016; COSEWIC, 2011; Larson, Seeb, et al, 2014). We provide evidence of successful implementation of machine-learning approaches on a metapopulation scale for site-by-site (river) classification to establish a relevant, nonredundant, maximally reduced panel of genetic markers By testing these novel approaches, we explore methods for capitalizing on large genomic data sets for genetic population assignment, with potential for application across a range of systems. Longitude (W) 60°36′27,0′′ 60°31,899′ 60°47′15,3′′ 60°37,293′ 61°27,976′ 61°28,730′ 61°3,275′ 61°02,216′ 60°48,863′ 60°08,279′ 57°52,374′ 60°05,392′ 60°08,523′ 60°10,575′ 60°27,769′
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have