The hallmark of genome sequencing projects is to provide genetic information on a species with functional annotations of genes and proteins. This process heavily relies on genome annotation based on homology detections from previously known genomic data. The rapid advancement of genome sequencing technologies has made genome sequencing affordable and effective in terms of the time frame for the generation of genomic data. Hence, genome sequencing has become a common practice. The annotation and characterization of newly sequenced genomes are crucial factors for the success of any biological experiment based on genomic data. The proteogenomic sector requires annotated genome further characterization of proteomic-based studies, and these are coupled with genomic and RNA-seq data. This chapter describes the genome annotation process from scratch genome sequencing to general genome annotation and specialized genome annotation using BLAST, BLAT2GO (now OMICSBOX), PANNZER, gene ontology (GO), and KEGG. It also covers different processes like repeat identification and masking, gene prediction, genome-wide annotation process, and RNA-seq protocol. It also focuses on genes of interest such as genes associated with BGCs (biosynthetic gene clusters), carbohydrate-active enzymes (CAZymes), serpins (serine protease inhibitors), membrane transporters, and toxins. Manual annotation is also a critical step for at least some groups of genes, which are often critical for the species in consideration. This chapter also briefly describes the phylogenetic and phylogenomic processes required during genome annotation.
Read full abstract