Functional annotation and comparative modeling of ligninolytic enzymes from Trametes villosa (SW.) Kreisel for biotechnological applications
Functional annotation of Trametes villosa genome was performed to search Class II peroxidase proteins in this white-rot fungus, which can be valuable for several biotechnological processes. After sequence identification and manual curation, five proteins were selected to build 3 D models by comparative modeling. Analysis of sequential and structural sequences from selected targets revealed the presence of two putative Lignin Peroxidase and three putative Manganese Peroxidase on this fungal genome. All 3 D models had a similar folding pattern from selected 3 D structure templates. After minimization and validation steps, the best 3 D models were subjected to docking studies and molecular dynamics to identify structural requirements and the interactions required for molecular recognition. Two reliable 3 D models of Class II peroxidases, with typical catalytic site and architecture, and its protein sequences are indicated to recombinant production in biotechnological applications, such as bioenergy. Communicated by Ramaswamy H. Sarma
- Research Article
31
- 10.1074/mcp.m112.019471
- Oct 1, 2012
- Molecular & Cellular Proteomics
Peptide sequencing by computational assignment of tandem mass spectra to a database of putative protein sequences provides an independent approach to confirming or refuting protein predictions based on large-scale DNA and RNA sequencing efforts. This use of mass spectrometrically-derived sequence data for testing and refining predicted gene models has been termed proteogenomics. We report herein the application of proteogenomic methodology to a database of 10.9 million tandem mass spectra collected over a period of two years from proteolytically generated peptides isolated from the model legume Medicago truncatula. These spectra were searched against a database of predicted M. truncatula protein sequences generated from public databases, in silico gene model predictions, and a whole-genome six-frame translation. This search identified 78,647 distinct peptide sequences, and a comparison with the publicly available proteome from the recently published M. truncatula genome supported translation of 9,843 existing gene models and identified 1,568 novel peptides suggesting corrections or additions to the current annotations. Each supporting and novel peptide was independently validated using mRNA-derived deep sequencing coverage and an overall correlation of 93% between the two data types was observed. We have additionally highlighted examples of several aspects of structural annotation for which tandem MS provides unique evidence not easily obtainable through typical DNA or RNA sequencing. Proteogenomic analysis is a valuable and unique source of information for the structural annotation of genomes and should be included in such efforts to ensure that the genome models used by biologists mirror as accurately as possible what is present in the cell.
- Research Article
18
- 10.1186/s43008-021-00083-x
- Nov 1, 2021
- IMA Fungus
BackgroundThe genome sequence data of more than 65985 species are publicly available as of October 2021 within the National Center for Biotechnology Information (NCBI) database alone and additional genome sequences are available in other databases and also continue to accumulate at a rapid pace. However, an error-free functional annotation of these genome is essential for the research communities to fully utilize these data in an optimum and efficient manner.ResultsAn analysis of proteome sequence data of 689 fungal species (7.15 million protein sequences) was conducted to identify the presence of functional annotation errors. Proteins associated with calcium signaling events, including calcium dependent protein kinases (CDPKs), calmodulins (CaM), calmodulin-like (CML) proteins, WRKY transcription factors, selenoproteins, and proteins associated with the terpene biosynthesis pathway, were targeted in the analysis. Gene associated with CDPKs and selenoproteins are known to be absent in fungal genomes. Our analysis, however, revealed the presence of proteins that were functionally annotated as CDPK proteins. However, InterproScan analysis indicated that none of the protein sequences annotated as “calcium dependent protein kinase” were found to encode calcium binding EF-hands at the regulatory domain. Similarly, none of a protein sequences annotated as a “selenocysteine” were found to contain a Sec (U) amino acid. Proteins annotated as CaM and CMLs also had significant discrepancies. CaM proteins should contain four calcium binding EF-hands, however, a range of 2–4 calcium binding EF-hands were present in the fungal proteins that were annotated as CaM proteins. Similarly, CMLs should possess four calcium binding EF-hands, but some of the CML annotated fungal proteins possessed either three or four calcium binding EF-hands. WRKY transcription factors are characterized by the presence of a WRKY domain and are confined to the plant kingdom. Several fungal proteins, however, were annotated as WRKY transcription factors, even though they did not contain a WRKY domain.ConclusionThe presence of functional annotation errors in fungal genome and proteome databases is of considerable concern and needs to be addressed in a timely manner.
- Conference Article
- 10.1109/icsct53883.2021.9642535
- Aug 5, 2021
Protein sequences are symbols generally different characters representing the 20 amino acids used in human proteins those sequences can range from the very sort to the very long. There are many proteins database for the sequences are known but the function and functional annotation is not. Protein function prediction (PFP) as well as functional annotation (FA) from its structure or sequence is a major field of bioinformatics at the same time how to judge how well perform these algorithms. We proposed the novel method that converts the protein function problem into a language translation problem by a new proposed protein sequence language encoded to the protein function language decoded and build a recurrent neural machine encoding decoding translator (RNNEDT) based on the recurrent neural networks model. The excellent acting on training, testing datasets exhibits the proposed system as an improving direction for PFP. The proposed system alters the PFP matter to a language translation issue as well as applies a recurrent neural network machine version model for PFP, and visualizes the annotation of biological process (BP), molecular function (MF), as well as cellular component (CP).
- Research Article
6
- 10.1186/s12859-016-1295-z
- Nov 11, 2016
- BMC Bioinformatics
BackgroundDevelopment of automatable processes for clustering proteins into functionally relevant groups is a critical hurdle as an increasing number of sequences are deposited into databases. Experimental function determination is exceptionally time-consuming and can’t keep pace with the identification of protein sequences. A tool, DASP (Deacon Active Site Profiler), was previously developed to identify protein sequences with active site similarity to a query set. Development of two iterative, automatable methods for clustering proteins into functionally relevant groups exposed algorithmic limitations to DASP.ResultsThe accuracy and efficiency of DASP was significantly improved through six algorithmic enhancements implemented in two stages: DASP2 and DASP3. Validation demonstrated DASP3 provides greater score separation between true positives and false positives than earlier versions. In addition, DASP3 shows similar performance to previous versions in clustering protein structures into isofunctional groups (validated against manual curation), but DASP3 gathers and clusters protein sequences into isofunctional groups more efficiently than DASP and DASP2.ConclusionsDASP algorithmic enhancements resulted in improved efficiency and accuracy of identifying proteins that contain active site features similar to those of the query set. These enhancements provide incremental improvement in structure database searches and initial sequence database searches; however, the enhancements show significant improvement in iterative sequence searches, suggesting DASP3 is an appropriate tool for the iterative processes required for clustering proteins into isofunctional groups.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-016-1295-z) contains supplementary material, which is available to authorized users.
- Research Article
67
- 10.1016/j.crvi.2011.06.004
- Aug 25, 2011
- Comptes Rendus. Biologies
Pleurotus ostreatus heme peroxidases: An in silico analysis from the genome sequence to the enzyme molecular structure
- Research Article
21
- 10.1186/s12864-023-09924-y
- Jan 2, 2024
- BMC Genomics
BackgroundMicrosporidia are a large taxon of intracellular pathogens characterized by extraordinarily streamlined genomes with unusually high sequence divergence and many species-specific adaptations. These unique factors pose challenges for traditional genome annotation methods based on sequence similarity. As a result, many of the microsporidian genomes sequenced to date contain numerous genes of unknown function. Recent innovations in rapid and accurate structure prediction and comparison, together with the growing amount of data in structural databases, provide new opportunities to assist in the functional annotation of newly sequenced genomes.ResultsIn this study, we established a workflow that combines sequence and structure-based functional gene annotation approaches employing a ChimeraX plugin named ANNOTEX (Annotation Extension for ChimeraX), allowing for visual inspection and manual curation. We employed this workflow on a high-quality telomere-to-telomere sequenced tetraploid genome of Vairimorpha necatrix. First, the 3080 predicted protein-coding DNA sequences, of which 89% were confirmed with RNA sequencing data, were used as input. Next, ColabFold was used to create protein structure predictions, followed by a Foldseek search for structural matching to the PDB and AlphaFold databases. The subsequent manual curation, using sequence and structure-based hits, increased the accuracy and quality of the functional genome annotation compared to results using only traditional annotation tools. Our workflow resulted in a comprehensive description of the V. necatrix genome, along with a structural summary of the most prevalent protein groups, such as the ricin B lectin family. In addition, and to test our tool, we identified the functions of several previously uncharacterized Encephalitozoon cuniculi genes.ConclusionWe provide a new functional annotation tool for divergent organisms and employ it on a newly sequenced, high-quality microsporidian genome to shed light on this uncharacterized intracellular pathogen of Lepidoptera. The addition of a structure-based annotation approach can serve as a valuable template for studying other microsporidian or similarly divergent species.
- Research Article
5
- 10.1186/s12859-019-3038-4
- Sep 5, 2019
- BMC Bioinformatics
BackgroundAs genome sequencing projects grow rapidly, the diversity of organisms with recently assembled genome sequences peaks at an unprecedented scale, thereby highlighting the need to make gene functional annotations fast and efficient. However, the (high) quality of such annotations must be guaranteed, as this is the first indicator of the genomic potential of every organism.Automatic procedures help accelerating the annotation process, though decreasing the confidence and reliability of the outcomes. Manually curating a genome-wide annotation of genes, enzymes and transporter proteins function is a highly time-consuming, tedious and impractical task, even for the most proficient curator. Hence, a semi-automated procedure, which balances the two approaches, will increase the reliability of the annotation, while speeding up the process. In fact, a prior analysis of the annotation algorithm may leverage its performance, by manipulating its parameters, hastening the downstream processing and the manual curation of assigning functions to genes encoding proteins.ResultsHere SamPler, a novel strategy to select parameters for gene functional annotation routines is presented. This semi-automated method is based on the manual curation of a randomly selected set of genes/proteins. Then, in a multi-dimensional array, this sample is used to assess the automatic annotations for all possible combinations of the algorithm’s parameters. These assessments allow creating an array of confusion matrices, for which several metrics are calculated (accuracy, precision and negative predictive value) and used to reach optimal values for the parameters.ConclusionsThe potential of this methodology is demonstrated with four genome functional annotations performed in merlin, an in-house user-friendly computational framework for genome-scale metabolic annotation and model reconstruction. For that, SamPler was implemented as a new plugin for the merlin tool.
- Supplementary Content
196
- 10.1371/journal.pgen.0020062
- Apr 1, 2006
- PLoS Genetics
The international FANTOM consortium aims to produce a comprehensive picture of the mammalian transcriptome, based upon an extensive cDNA collection and functional annotation of full-length enriched cDNAs. The previous dataset, FANTOM2, comprised 60,770 full-length enriched cDNAs. Functional annotation revealed that this cDNA dataset contained only about half of the estimated number of mouse protein-coding genes, indicating that a number of cDNAs still remained to be collected and identified. To pursue the complete gene catalog that covers all predicted mouse genes, cloning and sequencing of full-length enriched cDNAs has been continued since FANTOM2. In FANTOM3, 42,031 newly isolated cDNAs were subjected to functional annotation, and the annotation of 4,347 FANTOM2 cDNAs was updated. To accomplish accurate functional annotation, we improved our automated annotation pipeline by introducing new coding sequence prediction programs and developed a Web-based annotation interface for simplifying the annotation procedures to reduce manual annotation errors. Automated coding sequence and function prediction was followed with manual curation and review by expert curators. A total of 102,801 full-length enriched mouse cDNAs were annotated. Out of 102,801 transcripts, 56,722 were functionally annotated as protein coding (including partial or truncated transcripts), providing to our knowledge the greatest current coverage of the mouse proteome by full-length cDNAs. The total number of distinct non-protein-coding transcripts increased to 34,030. The FANTOM3 annotation system, consisting of automated computational prediction, manual curation, and final expert curation, facilitated the comprehensive characterization of the mouse transcriptome, and could be applied to the transcriptomes of other species.
- Single Report
- 10.2172/1056641
- Mar 14, 2009
Lignocellulosic accounts for a large percentage of material that can be utilized for biofuels. The most costly part of lignocellulosic material processing is the initial hydrolysis of the wood which is needed to circumvent the lignin barrier and the crystallinity of cellulose. Enzymes will play an increased role in this conversion in that they potentially provide an alternative to costly and caustic high temperature and acid treatment. The increasing use of enzymes in biotechnology is facilitated by both continued improvements in enzyme technology but also in the discovery of new and novel enzymes. The present proposal is aimed at identifying the enzymes which are known to depolymerize woody biomass. Fundamental understanding of how nature gains access to cellulose and hemicellulose will impact all applications. Because fungi are the only known microbes capable of circumventing the lignin barrier, knowledge of the enzyme they use is of great potential for biofuel processing. Nature has evolved different fungal mechanisms for enzymatic hydrolysis of wood. Most notable are the white-rot fungi (wrf) and the brown-rot fungi (brf). This proposed research aims at determining the complete transcriptome of three wrf and two brf to determine the enzymes involved in lignocellulose degradation. The transcriptome work will be supported by enzyme characterization (and zymograms) and finally analysis of the lignin component to determine the mode of lignin modification. In this proposed research, we hypothesize that: 1) Determination of the complete transcriptome of closely related white and brown rot fungi will lead to knowledge of the relevant enzymes involved in wood degradation. 2) Knowledge of the extracellular transcriptome and the mechanism of wood decay can only be obtained if the products of the decay are known. As such, characterization of the lignin oxidation products will correlate the enzymes involved (obtained from the transcriptome) to the lignin oxidation products. The Department of Energy has sequenced the P. chrysosporium genome and has approved the sequencing of the genome of the closely-related brown rot fungus P. placenta. This comparative genomics approach will yield important information on differences between these two fungi. Analysis of gene unique to each fungus (which have been lost or gained) can potentially lead to determining the enzymes which are responsible for each type of decay. This comparison, however, would not be complete without comparing the transcriptome and the proteome/enzymes. Comparative genomics may tell us which genes may be important, but it will not tell us when these genes are expressed, at what levels and will not necessarily tell us what these genes do.
- Research Article
- 10.1093/database/baaf002
- Feb 12, 2025
- Database: The Journal of Biological Databases and Curation
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has been circulating and adapting within the human population for >4 years. A large number of mutations have occurred in the viral genome, resulting in significant variants known as variants of concern (VOCs) and variants of interest (VOIs). The spike (S) protein harbors many of the characteristic mutations of VOCs and VOIs, and significant efforts have been made to explore functional effects of the mutations in the S protein, which can cause or contribute to viral infection, transmission, immune evasion, pathogenicity, and illness severity. However, the knowledge and understanding are dispersed throughout various publications, and there is a lack of a well-structured database for functional annotation that is based on manual curation. AnnCovDB is a database that provides manually curated functional annotations for mutations in the S protein of SARS-CoV-2. Mutations in the S protein carried by at least 8000 variants in the GISAID were chosen, and the mutations were then utilized as query keywords to search in the PubMed database. The searched publications revealed that 2093 annotation entities for 205 single mutations and 93 multiple mutations were manually curated. These entities were organized into multilevel hierarchical categories for user convenience. For example, one annotation entity of N501Y mutation was ‘Infectious cycle➔Attachment➔ACE2 binding affinity➔Increase’. AnnCovDB can be used to query specific mutations and browse through function annotation entities.Database URL: https://AnnCovDB.app.bio-it.tech/
- Research Article
17
- 10.1504/ijewm.2013.050637
- Jan 1, 2013
- International Journal of Environment and Waste Management
White-rot fungi are extensively used in biotechnological processes but little is known about the disposal of fungal biomass after its use. Final products stability parameters (self-heating test and respiration index) indicate that co-composting of the white-rot fungus Trametes versicolor with Organic Fraction of Municipal Solid Wastes (OFMSW) ensure a higher stable final product than that obtained in OFMSW composting. Results suggested that the absence of fungus in the final product is probable owing to the thermophilic temperatures achieved during the composting process. These results indicate that composting may be extended to other residual biomass produced in biotechnological processes with white-rot fungi, considering spent biomass as a useful resource and minimising its risks for soil application.
- Book Chapter
4
- 10.5772/23724
- Nov 2, 2011
After a genome is assembled, the next step is genomic annotation, which can generate data that will allow various types of research of the model organism. Complete DNA sequences of the organism are then mapped in areas pertinent to the research objectives. In this chapter, we explore relevant ongoing research on genes and consider the gene as a basic mapping unit. Gene prediction is the first hurdle we come across to begin the extensive and intensive work demonstrated in first item, which deals with assembly of the genome. Gene prediction can be made with computational techniques for recognizing gene sequences, including stop codons and the initial portions of nucleotide sequences; it involves empirical rules concerning minimum coding sequences (CDS's) and is limited due to overlapping sequences coding forward and reverse. Finishing gene prediction step by a computer initiates the functional annotation stage. Functional annotation, item 3, can be done initially by computer, using similarity in sequence alignment. However, no software is capable of generating a functional annotation without many false positive results, since conserved protein domains with varied functions make gene sequence alignment difficult. In this case, after automatic annotation, the predicted genes need to be revised manually. In manual curation, item 4, an expert can more accurately locate frameshifts in the DNA strand. Depending on the number of errors found, genomic annotation may be postponed, requiring a return to the previous stage of genome assembly. In manual curation, the principal contributions are usually correction of the start codon position, gene name, gene product and, finally, identification of frameshifts. When functional annotation is completed, the genome should subsequently be submitted. It occurs after the assembly and annotation steps making the data generated available in public-access databanks. Submission is a pre-requisite for publication in scientific journals. Another advantage of genome publication in public-access sites is that it permits use of various genome analysis tools. For example, searches for genomic plasticity, pangenomic study, exported antigens and evaluation of innate and adaptive immune responses. The pangenome approach, item 5, concepts of species can be used as a filter for targeting candidates for vaccines, diagnostic kits and drug development. For drug development, the
- Research Article
349
- 10.1093/nar/gkh008
- Jan 1, 2004
- Nucleic Acids Research
The SWISS-MODEL Repository is a database of annotated three-dimensional comparative protein structure models generated by the fully automated homology-modelling pipeline SWISS-MODEL. The Repository currently contains about 300,000 three-dimensional models for sequences from the Swiss-Prot and TrEMBL databases. The content of the Repository is updated on a regular basis incorporating new sequences, taking advantage of new template structures becoming available and reflecting improvements in the underlying modelling algorithms. Each entry consists of one or more three-dimensional protein models, the superposed template structures, the alignments on which the models are based, a summary of the modelling process and a force field based quality assessment. The SWISS-MODEL Repository can be queried via an interactive website at http://swissmodel.expasy. org/repository/. Annotation and cross-linking of the models with other databases, e.g. Swiss-Prot on the ExPASy server, allow for seamless navigation between protein sequence and structure information. The aim of the SWISS-MODEL Repository is to provide access to an up-to-date collection of annotated three-dimensional protein models generated by automated homology modelling, bridging the gap between sequence and structure databases.
- Research Article
90
- 10.1371/journal.pone.0175528
- Apr 10, 2017
- PLoS ONE
Innovative green technologies are of importance for converting plant wastes into renewable sources for materials, chemicals and energy. However, recycling agricultural and forestry wastes is a challenge. A solution may be found in the forest. Saprotrophic white-rot fungi are able to convert dead plants into consumable carbon sources. Specialized fungal enzymes can be utilized for breaking down hard plant biopolymers. Thus, understanding the enzymatic machineries of such fungi gives us hints for the efficient decomposition of plant materials. Using the saprotrophic white-rot fungus Pycnoporus coccineus as a fungal model, we examined the dynamics of transcriptomic and secretomic responses to different types of lignocellulosic substrates at two time points. Our integrative omics pipeline (SHIN+GO) enabled us to compress layers of biological information into simple heatmaps, allowing for visual inspection of the data. We identified co-regulated genes with corresponding co-secreted enzymes, and the biological roles were extrapolated with the enriched Carbohydrate-Active Enzyme (CAZymes) and functional annotations. We observed the fungal early responses for the degradation of lignocellulosic substrates including; 1) simultaneous expression of CAZy genes and secretion of the enzymes acting on diverse glycosidic bonds in cellulose, hemicelluloses and their side chains or lignin (i.e. hydrolases, esterases and oxido-reductases); 2) the key role of lytic polysaccharide monooxygenases (LPMO); 3) the early transcriptional regulation of lignin active peroxidases; 4) the induction of detoxification processes dealing with biomass-derived compounds; and 5) the frequent attachments of the carbohydrate binding module 1 (CBM1) to enzymes from the lignocellulose-responsive genes. Our omics combining methods and related biological findings may contribute to the knowledge of fungal systems biology and facilitate the optimization of fungal enzyme cocktails for various industrial applications.
- Research Article
20
- 10.1007/s00335-010-9296-0
- Oct 30, 2010
- Mammalian Genome
The innate immune responses mediated by Toll-like receptors (TLR) provide an evolutionarily well-conserved first line of defense against microbial pathogens. In the Reactome Knowledgebase we previously integrated annotations of human TLR molecular functions with those of over 4000 other human proteins involved in processes such as adaptive immunity, DNA replication, signaling, and intermediary metabolism, and have linked these annotations to external resources, including PubMed, UniProt, EntrezGene, Ensembl, and the Gene Ontology to generate a resource suitable for data mining, pathway analysis, and other systems biology approaches. We have now used a combination of manual expert curation and computer-based orthology analysis to generate a set of annotations for TLR molecular function in the chicken (Gallus gallus). Mammalian and avian lineages diverged approximately 300 million years ago, and the avian TLR repertoire consists of both orthologs and distinct new genes. The work described here centers on the molecular biology of TLR3, the host receptor that mediates responses to viral and other doubled-stranded polynucleotides, as a paradigm for our approach to integrated manual and computationally based annotation and data analysis. It tests the quality of computationally generated annotations projected from human onto other species and supports a systems biology approach to analysis of virus-activated signaling pathways and identification of clinically useful antiviral measures.