The identification of positive selection has been framed as a classification task, with Convolutional Neural Networks (CNNs) already outperforming summary statistics and likelihood-based approaches in accuracy. Despite the prevalence of CNN-based methods that manipulate the pixels of images representing raw genomic data as a preprocessing step to improve classification accuracy, the efficacy of these pixel-rearrangement techniques remains inadequately examined, particularly in the presence of confounding factors like population bottlenecks, migration and recombination hotspots. We introduce a set of pixel rearrangement algorithms aimed at enhancing CNN classification accuracy in detecting selective sweeps. These algorithms are employed to assess the performance of four CNN models for selective sweep detection. Our findings illustrate that the judicious application of rearrangement algorithms notably enhances the overall classification accuracy of a CNN across various datasets simulating confounding factors. We observed that sorting the columns of the genomic matrices has higher on CNN performance than rearranging the sequences. To some extent, these rearrangement algorithms are more robust to misspecified demographic models compared with the utilization of the default preprocessing algorithm as suggested by the respective authors of each CNN architecture. We provide the data rearrangement algorithms as a distinct package available for download at: https://github.com/Zhaohq96/Genetic-data-rearrangement.
Read full abstract