Next-generation sequencing (NGS) has propelled the field of genomics forward and produced whole genome sequences for numerous animal species and model organisms. However, despite this wealth of sequence information, comprehensive gene annotation efforts have proven challenging, especially for small proteins. Notably, conventional protein annotation methods were designed to intentionally exclude putative proteins encoded by short open reading frames (sORFs) less than 300 nucleotides in length to filter out the exponentially higher number of spurious noncoding sORFs throughout the genome. As a result, hundreds of functional small proteins called microproteins (<100 amino acids in length) have been incorrectly classified as noncoding RNAs or overlooked entirely. Here we provide a detailed protocol to leverage free, publicly available bioinformatic tools to query genomic regions for microprotein-coding potential based on evolutionary conservation. Specifically, we provide step-by-step instructions on how to examine sequence conservation and coding potential using Phylogenetic Codon Substitution Frequencies (PhyloCSF) on the user-friendly University of California Santa Cruz (UCSC) Genome Browser. Additionally, we detail steps to efficiently generate multiple species alignments of identified microprotein sequences to visualize amino acid sequence conservation and recommend resources to analyze microprotein characteristics, including predicted domain structures. These powerful tools can be used to help identify putative microprotein-coding sequences in noncanonical genomic regions or to rule out the presence of a conserved coding sequence with translational potential in a noncoding transcript of interest.
Read full abstract