The accuracy of genomic annotation is crucial for subsequent functional investigations; however, computational protocols used in high-throughput annotation of open reading frames (ORFs) can introduce inconsistencies. These inconsistencies, which lead to non-uniform extension or truncation of sequence ends, pose challenges for downstream analyses. Existing strategies to rectify these inconsistencies are time-consuming and labor-intensive, lacking specific approaches. To address this gap, we developed toGC, a tool that integrates genomic annotation with RNA-seq datasets to rectify annotation inconsistencies. Using toGC, we achieved an accuracy of nearly 100% accuracy in correcting inconsistencies in published P. sojae ORFs. We applied this innovative pipeline to the GPCR-bigrams gene family, which was predicted to have 42 members in the P. sojae genome but lacked experimental validation. By employing toGC, we identified 32 GPCR-bigram ORFs with inconsistencies between previous annotations and toGC-corrected sequences. Notably, among these were 5 genes (GPCR-TKL9, GPCR-TKL15, GPCR-PDE3, GPCR-AC3, and GPCR-AC4) showed substantial inconsistencies. Experimental gene annotation confirmed the effectiveness of toGC, as sequences obtained through cloning matched those annotated by toGC. Importantly, we discovered two novel GPCRs (GPCR-AC3 and GPCR-AC4), which were previously mispredicted as a single gene. CRISPR/Cas9-mediated knockout experiments revealed the involvement of GPCR-AC4 but not GPCR-AC3 in oospore production, further confirming their status as two separate genes. In addition to P. sojae, the reliability of the toGC pipeline in Phytophthora capsici and Pythium ultimum further emphasizes the robustness of this pipeline. Our findings highlight the utility of toGC for reliable gene model correction, facilitating investigations into biological functions and offering potential applications in diverse species analyses.
Read full abstract