Transcriptomic data is often expensive and difficult to generate in large cohorts relative to genomic data; therefore, it is often important to integrate multiple transcriptomic datasets from both microarray- and next generation sequencing (NGS)-based transcriptomic data across similar experiments or clinical trials to improve analytical power and discovery of novel transcripts and genes. However, transcriptomic data integration presents a few challenges including reannotation and batch effect removal. We developed the Gene Expression Data Integration (GEDI) R package to enable transcriptomic data integration by combining existing R packages. With just four functions, the GEDI R package makes constructing a transcriptomic data integration pipeline straightforward. Together, the functions overcome the complications in transcriptomic data integration by automatically reannotating the data and removing the batch effect. The removal of the batch effect is verified with principal component analysis and the data integration is verified using a logistic regression model with forward stepwise feature selection. To demonstrate the functionalities of the GEDI package, we integrated five bovine endometrial transcriptomic datasets from the NCBI Gene Expression Omnibus. These transcriptomic datasets were from multiple high-throughput platforms, namely, array-based Affymetrix and Agilent platforms, and NGS-based Illumina paired-end RNA-seq platform. Furthermore, we compared the GEDI package to existing tools and found that GEDI is the only tool that provides a full transcriptomic data integration pipeline including verification of both batch effect removal and data integration for downstream genomic and bioinformatics applications. © 2024 The Author(s). Current Protocols published by Wiley Periodicals LLC. Basic Protocol 1: ReadGE, a function to import gene expression datasets Basic Protocol 2: GEDI, a function to reannotate and merge gene expression datasets Basic Protocol 3: BatchCorrection, a function to remove batch effects from gene expression data Basic Protocol 4: VerifyGEDI, a function to confirm successful integration of gene expression data.
Read full abstract