Abstract
Scholars have noted major disparities in the extent of scientific research conducted among taxonomic groups. Such trends may cascade if future scientists gravitate towards study species with more data and resources already available. As new technologies emerge, do research studies employing these technologies continue these disparities? Here, using non-human primates as a case study, we identified disparities in massively parallel genomic sequencing data and conducted interviews with scientists who produced these data to learn their motivations when selecting study species. We tested whether variables including publication history and conservation status were significantly correlated with publicly available sequence data in the NCBI Sequence Read Archive (SRA). Of the 179.6 terabases (Tb) of sequence data in SRA for 519 non-human primate species, 135 Tb (approx. 75%) were from only five species: rhesus macaques, olive baboons, green monkeys, chimpanzees and crab-eating macaques. The strongest predictors of the amount of genomic data were the total number of non-medical publications (linear regression; r2 = 0.37; p = 6.15 × 10−12) and number of medical publications (r2 = 0.27; p = 9.27 × 10−9). In a generalized linear model, the number of non-medical publications (p = 0.00064) and closer phylogenetic distance to humans (p = 0.024) were the most predictive of the amount of genomic sequence data. We interviewed 33 authors of genomic data-producing publications and analysed their responses using grounded theory. Consistent with our quantitative results, authors mentioned their choice of species was motivated by sample accessibility, prior published work and relevance to human medicine. Our mixed-methods approach helped identify and contextualize some of the driving factors behind species-uneven patterns of scientific research, which can now be considered by funding agencies, scientific societies and research teams aiming to align their broader goals with future data generation efforts.
Highlights
Scholars have long observed taxonomic unevenness in terms of focal species included in published research studies
Future scientists will be primed to more readily and powerfully answer novel questions with species having extensive histories of prior study relative to more understudied taxa. This cascade is especially strong when the data produced in earlier studies have been made freely available to other researchers; in addition to reproducibility-related benefits, public data sharing allows for important, downstream research questions to be developed and answered using data originally generated for other research purposes
Are individual predictors such as non-medical publication history, medical publication history, geographical range, frequency in captivity, International Union for the Conservation of Nature (IUCN) Red List conservation status, activity pattern and phylogenetic distance to humans significantly associated with patterns of genomic data availability? we incorporated a qualitative component in which we interviewed first and/or corresponding authors on papers that generated non-human primate genomic sequence data to record their motivations and the factors that they explicitly considered when selecting species to study
Summary
Scholars have long observed taxonomic unevenness in terms of focal species included in published research studies. Species that are characterized as ‘models’ for various processes or fields—for example Arabidopsis thaliana in the botanical sciences or rhesus macaques (Macaca mulatta) in biomedicine—may continue to be disproportionately studied due to the benefit from the continuous accumulation of knowledge and research tools specific to that organism [7]. These patterns of taxonomic unevenness in scientific research matter. This cascade is especially strong when the data produced in earlier studies have been made freely available to other researchers; in addition to reproducibility-related benefits, public data sharing allows for important, downstream research questions to be developed and answered using data originally generated for other research purposes
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have