Abstract Introduction: Numerous transcripts annotated as long noncoding RNAs play a central role in cancer biology. For many, their noncoding status is merely a presumption. Genome-wide sequencing of ribosomal footprints has nominated thousands of unstudied open reading frames (ORFs) within lncRNAs, representing an expansion of the proteome. Here, we investigate previously unstudied proteins in cancer cell biology. Methods: Ribosome profiling data was analyzed with RibORF. 96 hours after infection with lentivirus for selected ORFs, L1000 expression profiling was performed on 4 cell lines. A CRISPR library was screened across 8 cells lines with sgRNA sequencing on days 0, 7 and 21 post-infection. Results: We analyzed ribosomal profiling data for 14 cell lines (~320 million sequencing reads). We nominated 28530 non-canonical ORFs within annotated protein-coding genes, 6697 ORFs in annotated lncRNAs, and 1252 ORFs in pseudogenes. For further study, we selected 553 candidate ORFs that exhibited compelling features, including DNA conservation, translational efficiency, protein domain, among others. We validated protein expression for 260 of 553 ORFs (47%): 89 (16%) had supporting peptides in deep-coverage proteomics datasets; 233 (42%) expressed protein after ectopic expression of individual V5-tagged cDNAs; 10 of 30 tested untagged ORFs expressed protein by biochemical in vitro translation. Ectopic overexpression followed by RNA profiling revealed 259 cDNAs that caused cellular transcriptional changes in at least one of four cancer cell lines (A549, HA1E, A375, MCF7). 137 of the 259 (49%) were validated proteins. As controls, we generated methionine-mutant constructs: 65 of 71 mutant cDNA experiments were unable to cause similar expression changes. We used a CRISPR library to identify novel ORF dependencies in 8 Cas9-derivatized cancer cell lines (MCF7, A549, A375, PC3, HEPG2, HELA, HA1E, HT29). For 42 ORFs, ≥ 2 targeting sgRNAs produced ≤ -1 log fold depletion in ≥ 1 cell lines. These ORFs were re-tested with a second sgRNA library. Next, we investigated compelling candidates more deeply with immunoprecipitation with mass spectrometry. For the cancer outlier transcript LINC01314, which encodes a highly conserved 59 amino acid protein harboring a cortexin domain (pfam domain cl12620), we found interactions with IMMT, SAMM50, and CHCHD3, members of a mitochondrial complex. Another example is LINC00116, which encodes a highly conserved 56 amino acid protein that binds the importin-nuclear pore complex. Conclusion: We establish a framework to discover, validate, and characterize unstudied proteins. About half of tested ORFs generated a detectable protein, and of these, half impacted cellular transcription. We discover novel gene dependencies, and are elucidating mechanisms for several ORFs. Together, our work is the first large-scale attempt to study the role of unannotated proteins in cancer cell biology. Citation Format: John R. Prensner, Oana Enache, Zhe Ji, Karsten Krug, Karl R. Clauser, Xiaoping Yang, Federica Piccioni, David E. Root, Todd R. Golub. Integrative functional proteogenomics for unannotated or uncharacterized proteins in cancer [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2019; 2019 Mar 29-Apr 3; Atlanta, GA. Philadelphia (PA): AACR; Cancer Res 2019;79(13 Suppl):Abstract nr 4344.
Read full abstract