Closely related species of Salmonidae, including Pacific and Atlantic salmon, can be distinguished from one another based on nucleotide sequences from the cytochrome c oxidase sub-unit 1 mitochondrial gene (COI), using ensembles of fragments aligned to genetic barcodes that serve as digital proxies for the relevant species. This is accomplished by exploiting both the nucleotide sequences and their quality scores recorded in a FASTQ file obtained via Next Generation (NextGen) Sequencing of mitochondrial DNA extracted from Coho salmon caught with hook and line in the Gulf of Alaska. The alignment is done using MUSCLE (Muscle 5.2) (Edgar in Nat Commun 13:6968, 2022), applied to multiple versions of each fragment perturbed according to the nucleobase identification error probabilities underlying the quality scores. The Damerau-Levenshtein distance was used to determine the genetic barcode of the candidate species that is closest to each aligned, perturbed fragment. The "votes" that the sampled fragments cast for the different candidate species are then pooled and converted into identification probabilities, using weights determined by the entropy of the fragment-specific identification probability distributions. This novel approach to quantify the uncertainty associated with measurements made using NextGen Sequencing can be applied to discriminate closely related species, hence to value-assignment for reference materials supporting determinations of the authenticity of seafood, for example, NIST Reference Materials 8256 and 8257 (Coho salmon) (Ellisor et al., 2021).
Read full abstract