A setback into a success: What can batch effects tell us about best practices in genomics?

Xavier Dallaire,Claire Mérot

doi:10.1111/1755-0998.13615

Abstract

The increasing access to high-throughput sequencing is certainly one of the major changes that molecular ecology has gone through over the last decade. With the positive trend towards more open science, most sequencing data sets are now available on public databases, which holds amazing potential, but also risks of introducing batch effects in studies combining data sets. In this issue of Molecular Ecology Resources, Lou and Therkildsen (2022) offer a timely discussion on the matter by analyzing an imperfect low-coverage Whole Genome Sequencing data set, in which they test the effects of differences in sequencing choices, DNA degradation, and read depth on routine population genomics analyses. Through a series of diagnostic tools, they uncover multiple factors producing technical artefacts that can bias estimates of genetic diversity, inference of population structure, and selection scans. For each confounding factor, they demonstrate the effectiveness of mitigation approaches and suggest other avenues to deal with the issue. In this perspective, we highlight considerations regarding (1) effects that arise from differences between batches of sequencing; (2) unavoidable heterogeneity within data sets; and (3) more general concerns around the use of next-generation sequencing in population genomics. Altogether, by exploring what may have appeared at first glimpse as a "failed" sequencing project, Lou and Therkildsen (2022) end up setting a standard of best practices to make the most of heterogeneous whole-genome sequences, opening a promising avenue towards efficient reuse of published data sets.

Full Text