Genomes OnLine Database Research Articles

Everyone would agree that metagenomics has been a great boon to the field of environmental microbiology. Fuelled by major advances in sequencing technology, the number of metagenome projects has exploded in recent years, with hundreds of environmental samples having been interrogated by shotgun sequencing (Markowitz et al., 2008; Meyer et al., 2008; Liolios et al., 2009). As a result, while just a few years ago it was possible for an individual investigator to be familiar with the major shotgun metagenomic data sets, today there are far too many to easily recite. Therefore we argue that the time is ripe for developing and implementing a metagenome classification system. Why classify metagenomes? The ability to extract, study and understand information from genomic data depends heavily on comparative analysis, and metagenomic data are no exception. Yet the appropriate comparisons to make are much less clear for metagenomes than for genomes, where the choice of comparison can be guided by phylogenetic classification. Moreover, even if the type of environmental studies one would want to compare is known, it still remains difficult to know how many and which are available given the lack of systematic nomenclature describing these projects (i.e. standardized naming) or categorization. For example, if you were looking for metagenomes from organisms in the digestive tracts of various animals, they might be named ‘gut’ but could also be ‘rumen’, ‘forestomach’, ‘caecum’ or ‘faecal’ communities. Currently metagenomic projects are not systematically classified. NCBI’s metagenomic project catalogue has implemented a simple and general project type distinction between ‘environmental’ and ‘host-associated’ projects (named correspondingly as Ecological and Organismal). This shallow classification is a starting point but does not address the many other environmental features potentially of interest for comparison. In order to circumvent the present difficulty in identifying appropriate metagenomic projects for comparative analysis, we present here a fivetiered metagenome naming and classification scheme. The top level includes the broad NCBI categories, but we also add a third ‘engineered’ category that separates out manipulated communities such as bioreactors or treatment plants from natural environmental communities (Fig. 1). Each of these is then subcategorized according to a variety of criteria, taking into account knowledge of key variables that influence community composition [e.g. salinity (Lozupone and Knight, 2007) or soil pH (Lauber et al., 2009)]. Where possible, we have taken advantage of existing classification systems such as the Environment Ontology (EnvO; http://www.environmentontology.org/). Environmental communities are separated by the ecosystem category (aquatic, terrestrial, air) and ecosystem type (e.g. freshwater, marine) with more detailed categorizations based on specific features (e.g. salinity, pH). Host-associated communities are defined by host phylogeny, then sampling site; and finally engineered communities are classified by their function (e.g. bioremediation or food production) with further levels based on specific substrates or features. In some cases an individual ‘project’ may span multiple categories because it includes samples from different habitat types. A sampling of the higher-level categories is shown in Table 1, and the complete proposed schema is available from GOLD (Genomes OnLine Database, http://www.genomesonline.org/cgi-bin/ GOLD/bin/metagenomic_classification.cgi) and IMG/M (http://img.jgi.doe.gov/m/). Although we developed this schema to address an immediate need within these databases, we hope that it will provide the basis for a broadly *For correspondence. E-mail nckyrpides@lbl.gov; Tel. (+1) 925 296 5718; Fax (+1) 925 296 5720. Environmental Microbiology (2010) 12(7), 1803–1805 doi:10.1111/j.1462-2920.2010.02270.x

Since the first complete genome sequence, that of the bacterium Haemophilus influenzae, was published in 1995 (1), a flurry of activity has seen the completion of the genomic sequences for more than 149 organisms (16 archael, 114 bacterial, and 19 eukaryotic). An up-to-date list of completed genomes is maintained on the GOLD website (Genomes OnLine Database, http://igweb.integratedgenomics.com/GOLD). Early in 2001, a major milestone was reached with the completion of the human genome sequence (2,3). A major challenge in the postgenome era will be to elucidate the biological function of the large number of novel gene products that have been revealed by the genome sequencing initiatives, to understand their role in health and disease, and to exploit this information to develop new diagnostic and therapeutic agents. The assignment of protein function will require detailed and direct analysis of the patterns of expression, interaction, localization, and structure of the proteins encoded by genomes; the area now known as “proteomics” (4). The ability of polyacrylamide gel electrophoresis (PAGE) techniques to resolve individual proteins of complex mixtures has resulted in this group of methods being indispensable to the protein biochemist. In particular, two-dimensional gel electrophoresis (2-DE) can routinely separate up to 2000 proteins from whole-cell and tissue homogenates, and the use of large format gels separations of up to 10,000 proteins have been described (5,6). For this reason, 2-DE remains the core technology of choice for protein separation in the majority of proteomics projects. Combined with the currently available panel of sensitive detection methods (7) and computer analysis tools (8), this methodology provides a powerful approach to the investigation of differential protein expression. Although gel electrophoresis procedures can provide characterization of proteins in terms of their charge (pI), size (Mr), relative hydrophobicity, and abundance, they give no direct clues as to their identities or functions. Fortunately, over the last years, a variety of sensitive methods have become available for the identification and charcaterization of proteins separated by gel electrophoresis. Conventional methods include reactivity with specific monoclonal and polyclonal antibodies, microsequencing by automated Edman degradation (9), and amino acid compositional analysis (10). More recently, techniques

Genomes OnLine Database Research Articles

Related Topics

Articles published on Genomes OnLine Database

Correction to: Standardized naming of microbiome samples in Genomes OnLine Database

Standardized naming of microbiome samples in Genomes OnLine Database.

Twenty-five years of Genomes OnLine Database (GOLD): data updates and new features in v.9.

Genomes OnLine Database (GOLD) v.8: overview and updates.

U.S. Federal Genomic Data Release and Access Policies

Genomes OnLine database (GOLD) v.7: updates and new features.

Genomes OnLine Database (GOLD) v.6: data updates and feature enhancements.

Mobile health applications for asthma

The Genomes OnLine Database (GOLD) v.5: a metadata management system based on a four level (meta)genome project classification.

The Metadata Coverage Index (MCI): A standardized metric for quantifying database metadata richness.

Analyzing large biological datasets with association networks

Tracing Lifestyle Adaptation in Prokaryotic Genomes

The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata.

A call for standardized classification of metagenome projects

Discovery and Characterization of an Amidinotransferase Involved in the Modification of Archaeal tRNA

Standards and standard-compliance

The Genomes On Line Database (GOLD) in 2009: status of genomic and metagenomic projects and their associated metadata.

The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata

The Genomes On Line Database (GOLD) v.2: a monitor of genome projects worldwide

Electroblotting of proteins from polyacrylamide gels.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Genomes OnLine Database Research Articles

Related Topics

Articles published on Genomes OnLine Database

Correction to: Standardized naming of microbiome samples in Genomes OnLine Database

Standardized naming of microbiome samples in Genomes OnLine Database.

Twenty-five years of Genomes OnLine Database (GOLD): data updates and new features in v.9.

Genomes OnLine Database (GOLD) v.8: overview and updates.

U.S. Federal Genomic Data Release and Access Policies

Genomes OnLine database (GOLD) v.7: updates and new features.

Genomes OnLine Database (GOLD) v.6: data updates and feature enhancements.

Mobile health applications for asthma

The Genomes OnLine Database (GOLD) v.5: a metadata management system based on a four level (meta)genome project classification.

The Metadata Coverage Index (MCI): A standardized metric for quantifying database metadata richness.

Analyzing large biological datasets with association networks

Tracing Lifestyle Adaptation in Prokaryotic Genomes

The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata.

A call for standardized classification of metagenome projects

Discovery and Characterization of an Amidinotransferase Involved in the Modification of Archaeal tRNA

Standards and standard-compliance

The Genomes On Line Database (GOLD) in 2009: status of genomic and metagenomic projects and their associated metadata.

The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata

The Genomes On Line Database (GOLD) v.2: a monitor of genome projects worldwide

Electroblotting of proteins from polyacrylamide gels.