Abstract

Replication is a fundamental tenet of science, but there is increasing fear among scientists that too few scientific studies can be replicated. This has been termed the “replication crisis” (Ioannidis 2005, Schooler 2014). Scientific papers often include inadequate detail to enable replication (Haddaway and Verhoeven 2015, Archmiller et al. 2020), many attempted replications of well-known scientific studies have failed in a wide variety of disciplines (Moonesinghe et al. 2007, Hewitt 2012, Bohannon 2015, Open Science Collaboration 2015), and rates of paper retractions are increasing (Cokol et al. 2008, Steen et al. 2013). Because of this, researchers are working to develop new ways for researchers, research institutions, research funders, and journals to overcome this problem (Peng 2011, Fiedler et al. 2012, Sandve et al. 2013, Stodden et al. 2013). Because replicating studies with new independent data is expensive, rarely published in high-impact journals, and sometimes even methodologically impossible, computationally reproducible research (most often termed simply “reproducible research”) is often suggested as a pathway for increasing our ability to assess the validity and rigor of scientific results (Peng 2011). Research is reproducible when others can reproduce the results of a scientific study given only the original data, code, and documentation (Essawy et al. 2020). This approach focuses on the research process after data collection is complete, and it has many (though not all) of the advantages of replicating studies with independent data while minimizing the largest barrier (i.e., the financial and time costs of collecting new data). Replicating studies remains the gold standard for rigorous scientific research, but reproducibility is increasingly viewed as a minimum standard that all scientists should strive toward (Peng 2011, Sandve et al. 2013, Archmiller et al. 2020, Culina et al. 2020). This commentary describes basic requirements for such reproducible research in the fields of ecology and evolutionary biology. In it, we make the case for why all research should be reproducible, explain why research is often not reproducible, and present a simple three-part framework all researchers can use to make their research more reproducible. These principles are applicable to researchers working in all sub-disciplines within ecology and evolutionary biology with data sets of all sizes and levels of complexity. Reproducible research is a by-product of careful attention to detail throughout the research process and allows researchers to ensure that they can repeat the same analysis multiple times with the same results, at any point in that process. Because of this, researchers who conduct reproducible research are the primary beneficiaries of this practice. First, reproducible research helps researchers remember how and why they performed specific analyses during the course of a project. This enables easier explanation of work to collaborators, supervisors, and reviewers, and it allows collaborators to conduct supplementary analyses more quickly and more efficiently. Second, reproducible research enables researchers to quickly and simply modify analyses and figures. This is often requested by supervisors, collaborators, and reviewers across all stages of a research project, and expediting this process saves substantial amounts of time. When analyses are reproducible, creating a new figure may be as easy as changing one value in a line of code and re-running a script, rather than spending hours recreating a figure from scratch. Third, reproducible research enables quick reconfiguration of previously conducted research tasks so that new projects that require similar tasks become much simpler and easier. Science is an iterative process, and many of the same tasks are performed over and over. Conducting research reproducibly enables researchers to re-use earlier materials (e.g., analysis code, file organization systems) to execute these common research tasks more efficiently in subsequent iterations. Fourth, conducting reproducible research is a strong indicator to fellow researchers of rigor, trustworthiness, and transparency in scientific research. This can increase the quality and speed of peer review, because reviewers can directly access the analytical process described in a manuscript. Peer reviewers' work becomes easier, and they may be able to answer methodological questions without asking the authors. Reviewers can check whether code matches with methods described in the text of a manuscript to make sure that authors correctly performed the analyses as described, and it increases the probability that errors are caught during the peer-review process, decreasing the likelihood of corrections or retractions after publication. Finally, it also protects researchers from accusations of research misconduct due to analytical errors, because it is unlikely that researchers would openly share fraudulent code and data with the rest of the research community. Finally, reproducible research increases paper citation rates (Piwowar et al. 2007, McKiernan et al. 2016) and allows other researchers to cite code and data in addition to publications. This enables a given research project to have more impact than it would if the data or methods were hidden from the public. For example, researchers can re-use code from a paper with similar methods and organize their data in the same manner as the original paper and then cite code from the original paper in their manuscript. A third team of researchers may conduct a meta-analysis on the phenomenon described in these two research papers and thus use and cite both of these papers and the data from those papers in their meta-analysis. Papers are more likely to be cited in these re-use cases if full information about data and analyses are available (Whitlock 2011, Culina et al. 2018). Reproducible research also benefits others in the scientific community. Sharing data, code, and detailed research methods and results leads to faster progress in methodological development and innovation because research is more accessible to more scientists (Parr and Cummings 2005, Roche et al. 2015, Mislan et al. 2016). First, reproducible research allows others to learn from your work. Scientific research has a steep learning curve, and allowing others to access data and code gives them a head start on performing similar analyses. For example, researchers who are new to an analytical technique can use code shared with the research community by researchers with more experience with that technique to learn how to rigorously perform and validate these analyses. This allows researchers to conduct research that is more rigorous from the outset, rather than having to spend months or years trying to figure out current “best practices” through trial and error. Modifying existing resources can also save time and effort for experienced researchers—even experienced coders can modify existing code much faster than they can write code from scratch. Sharing code thus allows experienced researchers to perform similar analyses more quickly. Second, reproducible research allows others to understand and reproduce a researcher's work. Allowing others to access data and code makes it easier for other scientists to perform follow-up studies to increase the strength of evidence for the phenomenon of interest. It also increases the likelihood that similar studies are compatible with one another, and that a group of studies can together provide evidence in support of or in opposition to a concept. In addition, sharing data and code increases the utility of these studies for meta-analyses that are important for generalizing and contextualizing the findings of studies on a topic. Meta-analyses in ecology and evolutionary biology are often hindered by incompatibility of data between studies, or lack of documentation for how those data were obtained (Stewart 2010, Culina et al. 2018). Well-documented, reproducible findings enhance the likelihood that data can be used in future meta-analyses (Gerstner et al. 2017). Third, reproducible research allows others to protect themselves from your mistakes. Mistakes happen in science. Allowing others to access data and code gives them a better chance to critically analyze the work, which can lead to coauthors or reviewers discovering mistakes during the revision process, or other scientists discovering mistakes after publication. This prevents mistakes from compounding over time and provides protection for collaborators, research institutions, funding organizations, journals, and others who may be affected when such mistakes happen. There are a number of reasons that most research is not reproducible. Rapidly developing technologies and analytical tools, novel interdisciplinary approaches, unique ecological study systems, and increasingly complex data sets and research questions hinder reproducibility, as does pressure on scientists to publish novel research quickly. This multitude of barriers can be simplified into four primary themes: (1) complexity, (2) technological change, (3) human error, and (4) concerns over intellectual property rights. Each of these concerns can contribute to making research less reproducible and can be valid in some scenarios. However, each of these factors can also be addressed easily via well-developed tools, protocols, and institutional norms concerning reproducible research. Science is difficult, and scientific research requires specialized (and often proprietary) knowledge and tools that may not be available to everyone who would like to reproduce research. For example, studies in the fields of ecology and evolutionary biology often involve study systems, mathematical models, and statistical techniques that require a large amount of domain knowledge to understand, and these analyses can therefore be difficult to reproduce for those with limited understanding of any of the necessary underlying bases of knowledge. Some analyses may require high-performance computing clusters that use several different programming languages and software packages, or that are designed for specific hardware configurations. Other analyses may be performed using proprietary software programs such as SAS statistical software (SAS Institute Inc., Cary, North Carolina, USA) or ArcGIS (Esri, Redlands, California, USA) that require expensive software licenses. Lack of knowledge, lack of institutional infrastructure, and lack of funding all make research less reproducible. However, most of these issues can be mitigated fairly easily. Researchers can cite primers on complex subjects or analyses to reduce knowledge barriers. They can also thoroughly annotate analytical code with comments explaining each step in an analysis or provide extensive documentation on research software. Using open software (when possible) makes research more accessible for other researchers as well. Hardware and software used to analyze data both change over time, and they often change quickly. When old tools become obsolete, research becomes less reproducible. For example, reproducing research performed in 1960 using that era's computational tools would require a completely new set of tools today. Even research performed just a few years ago may have been conducted using software that is no longer available or is incompatible with other software that has since been updated. One minor update in a piece of software used in one minor analysis in an analytical workflow can render an entire project less reproducible. However, this too can be mitigated by using established tools in reproducible research. Careful documentation of versions of software used in analyses is a baseline requirement that anyone can meet. There are also more advanced tools that can help overcome such challenges in making research reproducible, including software containers, which are described in further detail below. Though fraudulent research is often cited as reason to make research more reproducible (Ioannidis 2005, Laine et al. 2007, Crocker and Cooper 2011), many more innocent reasons exist as to why research is often difficult to reproduce (Elliott 2014). People forget small details of how they performed analyses. They fail to describe data collection protocols or analyses completely despite their best efforts and multiple reviewers checking their work. They fail to collect or thoroughly document data that seem unimportant during collection but later turn out to be vital for unforeseen reasons. Science is performed by fallible humans, and a wide variety of common events can render research less reproducible. While not all of these challenges can be avoided by performing research reproducibly, a well-documented research process can guard against small errors and sloppy analyses. For example, carefully recording details such as when and where data were collected, what decisions were made during data collection, and what labeling conventions were used can make a huge difference in making sure that those data can later be used appropriately or re-purposed. Unintentional errors often occur during the data wrangling stage of a project, and these can be mitigated by keeping multiple copies of data to prevent data loss, carefully documenting the process for converting raw data into clean data, and double-checking a small test set of data before manipulating the data set as a whole. Researchers often hesitate to share data and code because doing so may allow other researchers to use data and code incorrectly or unethically. Other researchers may use publicly available data without notifying authors, leading to incorrect assumptions about the data that result in invalid analyses. Researchers may use publicly available data or code without citing the original data owners or code writers, who then do not receive proper credit for gathering expensive data or writing time-consuming code. Researchers may want to conceal data from others so that they can perform new analyses on those data in the future without worrying about others scooping them using the shared data. Rational self-interest can lead to hesitation to share data and code via many pathways, and we acknowledge that making data openly available is likely the most controversial aspect of reproducible research (Cassey and Blackburn 2006, Hampton et al. 2013, Mills et al. 2015, Mills et al. 2016, Whitlock et al. 2016). However, new tools for sharing data and code (outlined below and in Table 1) are making it easier for researchers to receive credit for doing so and to prevent others from using their data during an embargo period. Conducting reproducible research is not exceedingly difficult nor does it require encyclopedic knowledge of esoteric research tools and protocols. Whether they know it or not, most researchers already perform much of the work required to make research reproducible. To clarify this point, we outline below some basic steps toward making research more reproducible in three stages of a research project: (1) before data analysis, (2) during analysis, and (3) after analysis. We discuss practical tips that anyone can use, as well as more advanced tools for those who would like to move beyond basic requirements (Table 1). Most readers will recognize that reproducible research largely consists of widely accepted best practices for scientific research and that striving to meet a reasonable benchmark of reproducibility is both more valuable and more attainable than researchers may think. Reproducibility starts in the planning stage, with sound data management practices. It does not arise simply from sharing data and code online after a project is done. It is difficult to reproduce research when data are disorganized or missing, or when it is impossible to determine where or how data originated. First, data should be backed up at every stage of the research process and stored in multiple locations. This includes raw data (e.g., physical data sheets or initial spreadsheets), clean analysis-ready data (i.e., final data sets), and steps in between. Because it is entirely possible that researchers unintentionally alter or corrupt data while cleaning it up, raw data should always be kept as a backup. It is good practice to scan and save data sheets or laboratory notebook pages associated with a data set to ensure that these are kept paired with the digital data set. Ideally, different copies should be stored in different locations and using different storage media (e.g., paper copies and an external hard drive and cloud storage) to minimize risk of data loss from any single cause. Computers crash, hard drives are misplaced and stolen, and servers are hacked—researchers should not leave themselves vulnerable to those events. Digital data files should be stored in useful, flexible, portable, nonproprietary formats. Storing data digitally in a “flat” file format is almost always a good idea. Flat file formats are those that store data as plain text with one record per line (e.g., .csv or .txt files) and are the most portable formats across platforms, as they can be opened by anyone without proprietary software programs. For more complex data types, multi-dimensional relational formats such as json, hdf5, or other discipline-specific formats (e.g., biom and EML) may be appropriate. However, the complexity of these formats makes them difficult for many researchers to access and use appropriately, so it is best to stick with simpler file formats when possible. It is often useful to transform data into a “tidy” format (Wickham 2014) when cleaning up and standardizing raw data. Tidy data are in long format (i.e., variables in columns, observations in rows), have consistent data structure (e.g., character data are not mixed with numeric data for a single variable), and have informative and appropriately formatted headers (e.g., reasonably short variable names that do not include problematic characters like spaces, commas, and parentheses). Data in this format are easy to manipulate, model, and visualize during analysis. Metadata explaining what was done to clean up the data and what each of the variables means should be stored along with the data. Data are useless unless they can be interpreted (Roche et al. 2015), and metadata is how we maximize data interpretability across potential users. At a minimum, all data sets should include informative metadata that explains how and why data were collected, what variable names mean, whether a variable consists of raw or transformed data, and how observations are coded. Metadata should be placed in a sensible location that pairs it with the data set it describes. A few rows of metadata above a table of observations within the same file may work in some cases, or a paired text file can be included in the same directory as the data if the metadata must be more detailed. In the latter case, it is best to stick with a simple .txt file for metadata to maximize portability. Finally, researchers should organize files in a sensible, user-friendly structure and make sure that all files have informative names. It should be easy to tell what is in a file or directory from its name, and a consistent naming protocol (e.g., ending the filename with the date created or version number) provides even more information when searching through files in a directory. A consistent naming protocol for both directories and files also makes coding simpler by placing data, analyses, and products in logical locations with logical names. It is often more useful to organize files in small blocks of similar files, rather than having one large directory full of hundreds of files. For example, Noble (2009) suggests organizing computational projects within a main directory for each project, with sub-directories for the manuscript (doc/), data files (data/), analyses (scripts/ or src/), and analysis products (results/) within that directory. While this specific organization scheme may differ for other types of research, keeping all of the research products and documentation for a given project organized in this way makes it much easier to find everything at all stages of the research process and to archive it or share it with others once the project is finished. Throughout the research process, from data acquisition to publication, version control can be used to record a project's history and provide a log of changes that have occurred over the life of a project or research group. Version control systems record changes to a file or set of files over time so that you can recall specific versions later, compare differences between versions of files, and even revert files back to previous states in the event of mistakes. Many researchers use version control systems to track changes in code and documents over time. The most popular version control system is Git, which is often used via hosting services such as GitHub, GitLab, and BitBucket (Table 1). These systems are relatively easy to set up and use, and they systematically store snapshots of data, code, and accompanying files throughout the duration of a project. Version control also enables a specific snapshot of data or code to be easily shared, so that code used for analyses at a specific point in time (e.g., when a manuscript is submitted) can be documented, even if that code is later updated. When possible, all data wrangling and analysis should be performed using coding scripts—as opposed to using interactive or point-and-click tools—so that every step is documented and repeatable by yourself and others. Code both performs operations on data and serves as a log of analytical activities. Because of this second function, code (unlike point-and-click programs) is inherently reproducible. Most errors are unintentional mistakes made during data wrangling or analysis, so having a record of these steps ensures that analyses can be checked for errors and are repeatable on future data sets. If operations are not possible to script, then they should be well-documented in a log file that is kept in the appropriate directory. Analytical code should be thoroughly annotated with comments. Comments embedded within code serve as metadata for that code, substantially increasing its usefulness. Comments should contain enough information for an informed stranger to easily understand what the code does, but not so much that sorting through comments is a chore. Code comments can be tested for this balance by a friend who is knowledgeable about the general area of research but is not a project collaborator. In most scripting languages, the first few lines of a script should include a description of what the script does and who wrote it, followed by small blocks that import data, packages, and external functions. Data cleaning and analytical code then follows those sections, and sections are demarcated using a consistent protocol and sufficient comments to explain what function each section of code performs. Following a clean, consistent coding style makes code easier to read. Many well-known organizations (e.g., RStudio, Google) offer style guidelines for software code that were developed by many expert coders. Researchers should take advantage of these while keeping in mind that all style guides are subjective to some extent. Researchers should work to develop a style that works for them. This includes using a consistent naming convention (e.g., camelCase or snake_case) to name objects and embedding meaningful information in object names (e.g., using “_mat” as a suffix for objects to denote matrices or “_df” to denote data frames). Code should also be written in relatively short lines and grouped into blocks, as our brains process narrow columns of data more easily than longer ones (Martin 2009). Blocks of code also keep related tasks together and can function like paragraphs to make code more comprehensible. There are several ways to prevent coding mistakes and make code easier to use. First, researchers should automate repetitive tasks. For example, if a set of analysis steps are being used repeatedly, those steps can be saved as a function and loaded at the top of the script. This reduces the size of a script and eliminates the possibility of accidentally altering some part of a function so that it works differently in different locations within a script. Similarly, researchers can use loops to make code more efficient by performing the same task on multiple values or objects in series (though it is also important to note that nesting too many loops inside one another can quickly make code incomprehensible). A third way to reduce mistakes is to reduce the number of hard-coded values that must be changed to replicate analyses on an updated or new data set. It is often best to read in the data file(s) and assign parameter values at the beginning of a script, so that those variables can then be used throughout the rest of the script. When operating on new data, these variables can then be changed once at the beginning of a script rather than multiple times in locations littered throughout the script. Because incompatibility between operating systems or program versions can inhibit the reproducibility of research, the current gold standard for ensuring that analyses can be used in the future is to create a software container, such as a Docker (Merkel 2014) or Singularity (Kurtzer et al. 2017) image (Table 1). Containers are standalone, portable environments that contain the entire computing environment used in an analysis: software, all of its dependencies, libraries, binaries, and configuration files, all bundled into one package. Containers can then be archived or shared, allowing them to be used in the future, even as packages, functions, or libraries change over time. If creating a software container is infeasible or a larger step than researchers are willing to take, it is important to thoroughly report all software packages used, including version numbers. After the steps above have been followed, it is time for the step most people associate with reproducible research: sharing research with others. As should be clear by now, sharing the data and code is far from the only component of reproducible research; however, once Steps 1 and 2 above are followed, it becomes the easiest step. All input data, scripts, program versions, parameters, and important intermediate results should be made publicly and easily accessible. Various solutions are now available to make data sharing convenient, standardized, and accessible in a variety of research areas. There are many ways to do this, several of which are described below. Just as it is better to use scripts than interactive tools in analysis, it is better to produce tables and figures directly from code than to manipulate these using Adobe Illustrator, Microsoft PowerPoint, or other image editing programs. A large number of errors in finished manuscripts come from not remembering to change all relevant numbers or figures when a part of an analysis changes, and this task can be incredibly time-consuming when revising a manuscript. Truly reproducible figures and tables are created directly with code and integrated into documents in a way that allows automatic updating when analyses are re-run, creating a “dynamic” document. For example, documents written in LaTeX and markdown incorporate figures directly from a directory, so a figure will be updated in the document when the figure is updated in the directory (see Xie 2015 for a much lengthier discussion of dynamic documents). Both LaTeX and markdown can also be used to create presentations that can incorporate live-updated figures when code or data change, so that presentations can be reproducible as well. If using one of these tools is too large a leap, then simply producing figures directly from code—instead of adding annotations and arranging panels post hoc—can make a substantial difference in increasing the reproducibility of these products. Beyond creating dynamic documents, it is possible to make data wrangling, analysis, and creation of figures, tables, and manuscripts a “one-button” process using GNU Make (https://www.gnu.org/software/make/). GNU Make is a simple, yet powerful tool that can be used to coordinate and automate command-line processes, such as a series of independent scripts. For example, a Makefile can be written that will take the input data, clean and manipulate it, analyze it, produce figures and tables with results, and update a LaTeX or markdown manuscript document with those figures, tables, and any numbers included in the results. Setting up research projects to run in this way takes some time, but it can substantially expedite re-analyses and reduce copy-paste errors in manuscripts. Currently, code and data that can be used to replicate research are often found in the supplementary material of journal articles. Some journals (e.g., eLife) are even experimenting with embedding data and code in articles themselves. However, this is not a fail-safe method of archiving data and analyses. Supplementary materials can be lost if a journal switches publishers or when a publisher changes its website. In addition, research is only reproducible if it can be accessed, and many papers are published in journals that are locked behind paywalls that make them inaccessible to many researchers (Desjardins-Proulx et al. 2013, McKiernan et al. 2016, Alston 2019). To increase access to publications, authors can post preprints of final (but preacceptance) versions of manuscripts on a preprint server, or postprints of manuscripts on postprint servers. There are several widely used preprint servers (see Table 1 for three examples), and libraries at many research institutions host postprint servers. Similarly, data and code shared on personal websites are only available as long as websites are maintained and can be difficult to transfer when researchers migrate to another domain or website provider. Materials archived on personal websites are also often difficult for other scientists to find, as they are not usually linked to the published research and lack a permanent digital object identifier (DOI). To make research accessible to everyone, it is therefore better to use tools like data and code repositories than personal websites. Data archiving in online repositories has become more popular in recent years, a trend resulting from a combination of improvements in technology for sharing data, an increase in omics-scale data sets, and an increasing number of publisher and funding organizations who encourage or mandate data archiving (Whitlock et al. 2010, Whitlock 2011, Nosek et al. 2015). Data repositories are large databases that collect, manage, and store data sets for analysis, sharing, and reporting. Repositories may be either subject- or data-type-specific, or cross-disciplinary general repositories that accept multiple data types. Some are free, and others require a fee for depositing data. Journals often recommend appropriate repositories on their websites, and these recommendations should be consulted when submitting a manuscript. Three commonly used general purpose repositories are Dryad, Zenodo, and Figshare; each of these creates a DOI that allows data and code to be citable by others. Before choosing a repository, researchers should explore commonly used options in their specific fields of research. When data, code, software, and products of a research project are archived together, these are termed a “research compendium” (Gentleman and Lang 2007). Research compendia are increasingly common, although standards for what is included in research compendia differ between scientific fields. They provide a standardized and easily recognizable way to organize the digital materials of a research project, which enables other researchers to inspect, reproduce, and extend research (Marwick et al. 2018). In particular, the Open Science Framework (OSF; http://osf.io/) is a project management repository that goes beyond the repository features of Dryad, Zenodo, and Figshare to integrate and share components of a research project using collaborative tools. The goal of the OSF is to enable research to be shared at every step of the scientific process—from developing a research idea and designing a study, to storing and analyzing collected data and writing and publishing reports or papers (Sullivan et al. 2019). Open Science Framework is integrated with many other reproducible research tools, including widely used preprint servers, version control software, and publishers. While many researchers associate reproducible research primarily with a set of advanced tools for sharing research, reproducibility is just as much about simple work habits as the tools used to share data and code. We ourselves are not perfect reproducible researchers—we do not use all the tools mentioned in this commentary all the time and often fail to follow our own advice (almost always to our regret). Nevertheless, we recognize that reproducible research is a process rather than a destination and work hard to consistently increase the reproducibility of our work. We encourage others to do the same. Researchers can make strides toward a more reproducible research process by simply thinking carefully about data management and organization, coding practices, and processes for making figures and tables (Fig. 1). Time and expertise must be invested in learning and adopting these tools and tips, and this investment can be substantial. Nevertheless, we encourage our fellow researchers to work toward more open and reproducible research practices so we can all enjoy the resulting improvements in work habits, collaboration, scientific rigor, and trust in science. Many thanks to J.G. Harrison, B.J. Rick, A.L. Lewanski, E.A. Johnson, and F.S. Dobson for providing helpful comments on prepublication versions of this manuscript and to C.A. Buerkle for inspiring this project during his Computational Biology course at the University of Wyoming.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call