Brownotate, a Comprehensive Solution to Generate Protein Sequence Databases for Any Species.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Proteomics is strengthening research in biology and the diversification of the model organisms studied is very promising for fully understanding the complexity of biological principles. However, the lack of protein sequence databases for many species is a major bottleneck. Existing computational solutions are usually incomplete and/or only usable by bioinformaticians. We have built an open-source, user-friendly pipeline, called Brownotate, which allows anyone to generate protein sequence databases for any species as long as sequencing information is available. The pipeline can extract already existing protein sequences, but also automatically annotate any genome assembly or assemble and annotate any DNA sequence dataset. By testing the pipeline with numerous sequencing and assembly datasets covering a large part of the phylogenetic tree, we show that Brownotate generates fragmented but good quality assemblies and good quality annotations when compared to reference data. By comparing the use of protein databases generated by Brownotate or downloaded from NCBI to interpret proteomic data, we show very comparable results. The Brownotate pipeline is, therefore, an important new addition to the proteomics toolbox. The pipeline and its web interface are freely available at https://github.com/LSMBO/Brownotate and https://github.com/LSMBO/brownotate-app, respectively. SUMMARY: This study evaluated the performance of a newly developed pipeline, Brownotate, for the assembly and annotation of sequencing data for multiple species, from prokaryotes to eukaryotes. We compared their fragmentation level (assembly) and completeness based on evolutionary expectations of gene content, and we evaluated their overlap. Brownotate generated fragmented, slightly less complete assemblies. However, the overlap of proteins predicted was very good, despite an excess of predicted sequences of small size with Brownotate. In addition, the interpretation of proteomics data downloaded from PRIDE repository for 27 species was found to lead to very similar results regardless of the origin of the protein sequencing database used, whether it was generated by Brownotate or downloaded from NCBI. Brownotate, made available to the community, will, therefore, be a tool of choice to mitigate the lack of an appropriate protein sequence database for many species, and allow proteomists to analyse without delay samples from species for which only sequencing data are available.

Similar Papers
  • Research Article
  • Cite Count Icon 991
  • 10.1074/mcp.r500012-mcp200
Interpretation of Shotgun Proteomic Data
  • Jul 11, 2005
  • Molecular & Cellular Proteomics
  • Alexey I Nesvizhskii + 1 more

The shotgun proteomic strategy based on digesting proteins into peptides and sequencing them using tandem mass spectrometry and automated database searching has become the method of choice for identifying proteins in most large scale studies. However, the peptide-centric nature of shotgun proteomics complicates the analysis and biological interpretation of the data especially in the case of higher eukaryote organisms. The same peptide sequence can be present in multiple different proteins or protein isoforms. Such shared peptides therefore can lead to ambiguities in determining the identities of sample proteins. In this article we illustrate the difficulties of interpreting shotgun proteomic data and discuss the need for common nomenclature and transparent informatic approaches. We also discuss related issues such as the state of protein sequence databases and their role in shotgun proteomic analysis, interpretation of relative peptide quantification data in the presence of multiple protein isoforms, the integration of proteomic and transcriptional data, and the development of a computational infrastructure for the integration of multiple diverse datasets.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 29
  • 10.1074/mcp.m112.019471
A Proteogenomic Survey of the Medicago truncatula Genome
  • Oct 1, 2012
  • Molecular & Cellular Proteomics
  • Jeremy D Volkening + 9 more

Peptide sequencing by computational assignment of tandem mass spectra to a database of putative protein sequences provides an independent approach to confirming or refuting protein predictions based on large-scale DNA and RNA sequencing efforts. This use of mass spectrometrically-derived sequence data for testing and refining predicted gene models has been termed proteogenomics. We report herein the application of proteogenomic methodology to a database of 10.9 million tandem mass spectra collected over a period of two years from proteolytically generated peptides isolated from the model legume Medicago truncatula. These spectra were searched against a database of predicted M. truncatula protein sequences generated from public databases, in silico gene model predictions, and a whole-genome six-frame translation. This search identified 78,647 distinct peptide sequences, and a comparison with the publicly available proteome from the recently published M. truncatula genome supported translation of 9,843 existing gene models and identified 1,568 novel peptides suggesting corrections or additions to the current annotations. Each supporting and novel peptide was independently validated using mRNA-derived deep sequencing coverage and an overall correlation of 93% between the two data types was observed. We have additionally highlighted examples of several aspects of structural annotation for which tandem MS provides unique evidence not easily obtainable through typical DNA or RNA sequencing. Proteogenomic analysis is a valuable and unique source of information for the structural annotation of genomes and should be included in such efforts to ensure that the genome models used by biologists mirror as accurately as possible what is present in the cell.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 38
  • 10.1074/mcp.o111.015149
A Mass Spectrometry Proteomics Data Management Platform
  • Sep 1, 2012
  • Molecular & Cellular Proteomics
  • Vagisha Sharma + 3 more

Mass spectrometry-based proteomics is increasingly being used in biomedical research. These experiments typically generate a large volume of highly complex data, and the volume and complexity are only increasing with time. There exist many software pipelines for analyzing these data (each typically with its own file formats), and as technology improves, these file formats change and new formats are developed. Files produced from these myriad software programs may accumulate on hard disks or tape drives over time, with older files being rendered progressively more obsolete and unusable with each successive technical advancement and data format change. Although initiatives exist to standardize the file formats used in proteomics, they do not address the core failings of a file-based data management system: (1) files are typically poorly annotated experimentally, (2) files are "organically" distributed across laboratory file systems in an ad hoc manner, (3) files formats become obsolete, and (4) searching the data and comparing and contrasting results across separate experiments is very inefficient (if possible at all). Here we present a relational database architecture and accompanying web application dubbed Mass Spectrometry Data Platform that is designed to address the failings of the file-based mass spectrometry data management approach. The database is designed such that the output of disparate software pipelines may be imported into a core set of unified tables, with these core tables being extended to support data generated by specific pipelines. Because the data are unified, they may be queried, viewed, and compared across multiple experiments using a common web interface. Mass Spectrometry Data Platform is open source and freely available at http://code.google.com/p/msdapl/.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 1
  • 10.1186/s12866-021-02225-y
WeFaceNano: a user-friendly pipeline for complete ONT sequence assembly and detection of antibiotic resistance in multi-plasmid bacterial isolates
  • Jun 7, 2021
  • BMC Microbiology
  • Astrid P Heikema + 4 more

BackgroundBacterial plasmids often carry antibiotic resistance genes and are a significant factor in the spread of antibiotic resistance. The ability to completely assemble plasmid sequences would facilitate the localization of antibiotic resistance genes, the identification of genes that promote plasmid transmission and the accurate tracking of plasmid mobility. However, the complete assembly of plasmid sequences using the currently most widely used sequencing platform (Illumina-based sequencing) is restricted due to the generation of short sequence lengths. The long-read Oxford Nanopore Technologies (ONT) sequencing platform overcomes this limitation. Still, the assembly of plasmid sequence data remains challenging due to software incompatibility with long-reads and the error rate generated using ONT sequencing. Bioinformatics pipelines have been developed for ONT-generated sequencing but require computational skills that frequently are beyond the abilities of scientific researchers. To overcome this challenge, the authors developed ‘WeFaceNano’, a user-friendly Web interFace for rapid assembly and analysis of plasmid DNA sequences generated using the ONT platform. WeFaceNano includes: a read statistics report; two assemblers (Miniasm and Flye); BLAST searching; the detection of antibiotic resistance- and replicon genes and several plasmid visualizations. A user-friendly interface displays the main features of WeFaceNano and gives access to the analysis tools.ResultsPublicly available ONT sequence data of 21 plasmids were used to validate WeFaceNano, with plasmid assemblages and anti-microbial resistance gene detection being concordant with the published results. Interestingly, the “Flye” assembler with “meta” settings generated the most complete plasmids.ConclusionsWeFaceNano is a user-friendly open-source software pipeline suitable for accurate plasmid assembly and the detection of anti-microbial resistance genes in (clinical) samples where multiple plasmids can be present.

  • Single Book
  • Cite Count Icon 4
  • 10.1007/978-3-0348-5678-2
Methods in Protein Sequence Analysis
  • Jan 1, 1991

Methods in protein sequence analysis constitute important fields in rapid progress. We have experienced a continuous increase in analytical sensitivity coupled with decreases in time necessary for pur

  • Research Article
  • Cite Count Icon 223
  • 10.1385/0-89603-246-9:307
Using the FASTA program to search protein and DNA sequence databases.
  • Jan 1, 1994
  • Methods in molecular biology (Clifton, N.J.)
  • William R Pearson

As this volume illustrates, computers have become an integral tool in the analysis of DNA and protein sequence data. One of the most popular applications of computers in modern molecular biology is to characterize newly determined sequences by searching DNA and protein sequence databases. The FASTA* program (,) is widely used for such searches, because it is fast, sensitive, and readily available. FASTA is available as part of a package of programs that construct local and global sequence alignments. This chapter will describe a number of simple applications of FASTA and other programs in the FASTA package. This chapter focuses on the steps required to run the programs, rather than on the interpretation of the results of a FASTA search. For a more complete description of FASTA and related programs for identifying distantly related DNA and protein sequences, for evaluating the statistical significance of sequence similarities, and for identifying similar structures in DNA and protein sequences see ref. .

  • Research Article
  • Cite Count Icon 32
  • 10.1002/pmic.200600032
A database of unique protein sequence identifiers for proteome studies
  • Aug 1, 2006
  • PROTEOMICS
  • György Babnigg + 1 more

In proteome studies, identification of proteins requires searching protein sequence databases. The public protein sequence databases (e.g., NCBInr, UniProt) each contain millions of entries, and private databases add thousands more. Although much of the sequence information in these databases is redundant, each database uses distinct identifiers for the identical protein sequence and often contains unique annotation information. Users of one database obtain a database-specific sequence identifier that is often difficult to reconcile with the identifiers from a different database. When multiple databases are used for searches or the databases being searched are updated frequently, interpreting the protein identifications and associated annotations can be problematic. We have developed a database of unique protein sequence identifiers called Sequence Globally Unique Identifiers (SEGUID) derived from primary protein sequences. These identifiers serve as a common link between multiple sequence databases and are resilient to annotation changes in either public or private databases throughout the lifetime of a given protein sequence. The SEGUID Database can be downloaded (http://bioinformatics.anl.gov/SEGUID/) or easily generated at any site with access to primary protein sequence databases. Since SEGUIDs are stable, predictions based on the primary sequence information (e.g., pI, Mr) can be calculated just once; we have generated approximately 500 different calculations for more than 2.5 million sequences. SEGUIDs are used to integrate MS and 2-DE data with bioinformatics information and provide the opportunity to search multiple protein sequence databases, thereby providing a higher probability of finding the most valid protein identifications.

  • Research Article
  • Cite Count Icon 111
  • 10.1021/ac971157l
Identifying proteins using matrix-assisted laser desorption/ionization in-source fragmentation data combined with database searching.
  • Feb 1, 1998
  • Analytical Chemistry
  • Duane C Reiber + 2 more

Metastable ion decay in matrix-assisted laser desorption/ionization (MALDI) has become a routine method for obtaining primary structures of peptides. Significant fragmentation occurs in the MALDI ion source and can be observed via delayed ion extraction TOF-MS. In-source decay (ISD) can provide C- and N-terminal primary sequence data for even moderate-sized peptides (< 5000 Da). The unique cn series fragmentation that occurs in ISD has been exploited to obtain partial C-terminal sequences for proteins as large as human apotransferrin (75 kDa). Two approaches for combining this ISD MALDI-generated partial sequence information with protein database searching techniques are presented. In one approach, cyanogen bromide is used to cleave relatively large peptide fragments from a sample of human apotransferrin. One of the larger cleavage products (6034.84 Da) was isolated by HPLC and subjected to ISD MALDI analysis. An easily identified cn fragment ion series allowed two noncontiguous segments of the peptide's sequence to be determined (about 55% of the total sequence). This partial sequence information was used to search protein and oligonucleotide sequence databases. In addition to uniquely identifying human apotransferrin in a protein sequence database, an example of the use of this ISD MALDI-determined partial sequence information to search expressed sequence tag databases is presented. Such searches have the potential for rapidly identifying new genes that code for target proteins. An alternate approach for obtaining partial sequence information on proteins is also demonstrated that utilizes ISD MALDI fragmentation of the intact protein to generate partial sequence information. This approach is shown to generate about 5-7% of a protein's sequence, usually near the C-terminus of the protein. Examples of the ISD MALDI fragmentation data obtained from intact (reduced) human apotransferrin and intact (nonreduced) bovine serum albumin (66 kDa) proteins are presented.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 30
  • 10.1186/1471-2164-7-300
Pfarao: a web application for protein family analysis customized for cytoskeletal and motor proteins (CyMoBase)
  • Nov 29, 2006
  • BMC Genomics
  • Florian Odronitz + 1 more

BackgroundAnnotation of protein sequences of eukaryotic organisms is crucial for the understanding of their function in the cell. Manual annotation is still by far the most accurate way to correctly predict genes. The classification of protein sequences, their phylogenetic relation and the assignment of function involves information from various sources. This often leads to a collection of heterogeneous data, which is hard to track. Cytoskeletal and motor proteins consist of large and diverse superfamilies comprising up to several dozen members per organism. Up to date there is no integrated tool available to assist in the manual large-scale comparative genomic analysis of protein families.DescriptionPfarao (Protein Family Application for Retrieval, Analysis and Organisation) is a database driven online working environment for the analysis of manually annotated protein sequences and their relationship. Currently, the system can store and interrelate a wide range of information about protein sequences, species, phylogenetic relations and sequencing projects as well as links to literature and domain predictions. Sequences can be imported from multiple sequence alignments that are generated during the annotation process. A web interface allows to conveniently browse the database and to compile tabular and graphical summaries of its content.ConclusionWe implemented a protein sequence-centric web application to store, organize, interrelate, and present heterogeneous data that is generated in manual genome annotation and comparative genomics. The application has been developed for the analysis of cytoskeletal and motor proteins (CyMoBase) but can easily be adapted for any protein.

  • Research Article
  • Cite Count Icon 51
  • 10.1093/nar/25.1.24
The Protein Information Resource (PIR) and the PIR-International Protein Sequence Database.
  • Jan 1, 1997
  • Nucleic acids research
  • D G George + 13 more

From its origin, the PIR has aspired to support research in computational biology and genomics through the compilation of a comprehensive, quality controlled and well-organized protein sequence information resource. The resource originated with the pioneering work of the late Margaret O. Dayhoff in the early 1960s. Since 1988, the Protein Sequence Database has been maintained collaboratively by PIR-International, an association of macromolecular sequence data collection centers dedicated to fostering international cooperation as an essential element in the development of scientific databases. The work of the resource is widely distributed and is available on the World Wide Web, via FTP, E-mail server, CD-ROM and magnetic media. It is widely redistributed and incorporated into many other protein sequence data compilations including SWISS-PROT and theEntrezsystem of the NCBI.

  • Book Chapter
  • Cite Count Icon 10
  • 10.1007/978-3-319-11056-1_6
A Novel Hybridized Rough Set and Improved Harmony Search Based Feature Selection for Protein Sequence Classification
  • Jan 1, 2015
  • M Bagyamathi + 1 more

The progress in bio-informatics and biotechnology area has generated a big amount of sequence data that requires a detailed analysis. Recent advances in future generation sequencing technologies have resulted in a tremendous raise in the rate of that protein sequence data are being obtained. Big Data analysis is a clear bottleneck in many applications, especially in the field of bio-informatics, because of the complexity of the data that needs to be analyzed. Protein sequence analysis is a significant problem in functional genomics. Proteins play an essential role in organisms as they perform many important tasks in their cells. In general, protein sequences are exhibited by feature vectors. A major problem of protein dataset is the complexity of its analysis due to their enormous number of features. Feature selection techniques are capable of dealing with this high dimensional space of features. In this chapter, the new feature selection algorithm that combines the Improved Harmony Search algorithm with Rough Set theory for Protein sequences is proposed to successfully tackle the big data problems. An Improved harmony search (IHS) algorithm is a comparatively new population based meta-heuristic optimization algorithm. This approach imitates the music improvisation process, where each musician improvises their instrument’s pitch by seeking for a perfect state of harmony and it overcomes the limitations of traditional harmony search (HS) algorithm. An Improved Harmony Search hybridized with Rough Set Quick Reduct for faster and better search capabilities. The feature vectors are extracted from protein sequence database, based on amino acid composition and K-mer patterns or K-tuples and then feature selection is carried out from the extracted feature vectors. The proposed algorithm is compared with the two prominent algorithms, Rough Set Quick Reduct and Rough Set based PSO Quick Reduct. The experiments are carried out on protein primary single sequence data sets that are derived from PDB on SCOP classification, based on the structural class predictions such as all α, all β, all α + β and all α/ β. The feature subset of the protein sequences predicted by both existing and proposed algorithms are analyzed with the decision tree classification algorithms.KeywordsData MiningBig Data AnalysisBioinformaticsFeature SelectionProtein SequenceRough SetParticle Swarm OptimizationHarmony SearchProtein sequence classification

  • Research Article
  • Cite Count Icon 41
  • 10.1093/bioinformatics/btp366
Automated protein (re)sequencing with MS/MS and a homologous database yields almost full coverage and accuracy
  • Jun 17, 2009
  • Bioinformatics
  • Xiaowen Liu + 3 more

The bottom-up tandem mass spectrometry (MS/MS) is regularly used in proteomics nowadays for identifying proteins from a sequence database. De novo sequencing software is also available for sequencing novel peptides with relatively short sequence lengths. However, automated sequencing of novel proteins from MS/MS remains a challenging problem. Very often, although the target protein is novel, it has a homologous protein included in a known database. When this happens, we propose a novel algorithm and automated software tool, named Champs, for sequencing the complete protein from MS/MS data of a few enzymatic digestions of the purified protein. Validation with two standard proteins showed that our automated method yields >99% sequence coverage and 100% sequence accuracy on these two proteins. Our method is useful to sequence novel proteins or 're-sequence' a protein that has mutations comparing with the database protein sequence.

  • Research Article
  • Cite Count Icon 103
  • 10.1074/mcp.m110.006536
A Bioinformatics Workflow for Variant Peptide Detection in Shotgun Proteomics
  • Mar 9, 2011
  • Molecular &amp; Cellular Proteomics
  • Jing Li + 8 more

Shotgun proteomics data analysis usually relies on database search. However, commonly used protein sequence databases do not contain information on protein variants and thus prevent variant peptides and proteins from been identified. Including known coding variations into protein sequence databases could help alleviate this problem. Based on our recently published human Cancer Proteome Variation Database, we have created a protein sequence database that comprehensively annotates thousands of cancer-related coding variants collected in the Cancer Proteome Variation Database as well as noncancer-specific ones from the Single Nucleotide Polymorphism Database (dbSNP). Using this database, we then developed a data analysis workflow for variant peptide identification in shotgun proteomics. The high risk of false positive variant identifications was addressed by a modified false discovery rate estimation method. Analysis of colorectal cancer cell lines SW480, RKO, and HCT-116 revealed a total of 81 peptides that contain either noncancer-specific or cancer-related variations. Twenty-three out of 26 variants randomly selected from the 81 were confirmed by genomic sequencing. We further applied the workflow on data sets from three individual colorectal tumor specimens. A total of 204 distinct variant peptides were detected, and five carried known cancer-related mutations. Each individual showed a specific pattern of cancer-related mutations, suggesting potential use of this type of information for personalized medicine. Compatibility of the workflow has been tested with four popular database search engines including Sequest, Mascot, X!Tandem, and MyriMatch. In summary, we have developed a workflow that effectively uses existing genomic data to enable variant peptide detection in proteomics.

  • Research Article
  • Cite Count Icon 5
  • 10.3835/plantgenome2009.02.0004let
A Genome May Reduce Your Carbon Footprint
  • Mar 1, 2009
  • The Plant Genome
  • Christian M Tobias

A Genome May Reduce Your Carbon Footprint

  • Book Chapter
  • Cite Count Icon 1
  • 10.1201/9781003099079-4
Recent Advances in Protein Bioinformatics
  • May 19, 2021
  • Mahak Tufchi + 2 more

The bioinformatics branch deals with the application of high-order computational and analytical tools to capture and analyze biological information. Applying this to the branch of protein study or proteomics offers the management, data elaboration and integration of new software packages and algorithms. A database is the collection of sequence and structure information that is featured, annotated and retrievable. The data can be searchable using a search engine, updated periodically and even cross-referenced. This chapter thus deals with the current scenario of protein bioinformatics in terms of recent advances in protein sequence, structure and interaction databases along with specialised dedicated databases for model organisms. The protein sequence database has complimented upon UniProtKB as neXtProt (https://www.nextprot.org), a human protein knowledgebase generating data on proteomics (85%) and genetic variations in humans. The recent development in the field of protein-degrading enzymes in the MEROPS database publicly available at http://www.ebi.ac.uk/merops/ is an integrated source of all sets of information about peptidases. PDBFlex database (http://pdbflex.org) provides information on the flexibility of protein structures by analysing their structural differences and clustering them according to their similarities. Proteins with therapeutic and signaling properties have been designed along with fluorescent proteins with novel or enhanced utility using novel structure prediction databases. Developments in the field of protein bioinformatics have led to the establishment of new databases specified to protein-protein interactions (PPIs). P2Rank is a fast, accurate and stable predictor of ligand binding sites in proteins and in the near future can be used for new allosteric site predictions. APID is a comprehensive repository of curated ‘protein interactomes’ accessible at (http://apid.dep.usal.es). It includes 500 experimentally detected PPIs of more than 1100 organisms from nearly 30 species. The latest version of STRING (11.0) features a genome dataset as input with easy visualisation of interaction network (interactome) and thus performs gene-set enrichments. The Integrated Interactions Database (IID) is another comprehensive context-specific human PPI network available at http://ophid.utoronto.ca/iid. A recent update shows the involvement of 18 species with 4.8 million PPIs in 133 tissues. IID provides unique functionality with reduced false negatives and even supports non-human species.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.