GenoSets: visual analytic methods for comparative genomics.

Aurora A Cain,Robert Kosara,Cynthia J Gibas,Gajendra P S Raghava

doi:10.1371/journal.pone.0046401

Abstract

Many important questions in biology are, fundamentally, comparative, and this extends to our analysis of a growing number of sequenced genomes. Existing genomic analysis tools are often organized around literal views of genomes as linear strings. Even when information is highly condensed, these views grow cumbersome as larger numbers of genomes are added. Data aggregation and summarization methods from the field of visual analytics can provide abstracted comparative views, suitable for sifting large multi-genome datasets to identify critical similarities and differences. We introduce a software system for visual analysis of comparative genomics data. The system automates the process of data integration, and provides the analysis platform to identify and explore features of interest within these large datasets. GenoSets borrows techniques from business intelligence and visual analytics to provide a rich interface of interactive visualizations supported by a multi-dimensional data warehouse. In GenoSets, visual analytic approaches are used to enable querying based on orthology, functional assignment, and taxonomic or user-defined groupings of genomes. GenoSets links this information together with coordinated, interactive visualizations for both detailed and high-level categorical analysis of summarized data. GenoSets has been designed to simplify the exploration of multiple genome datasets and to facilitate reasoning about genomic comparisons. Case examples are included showing the use of this system in the analysis of 12 Brucella genomes. GenoSets software and the case study dataset are freely available at http://genosets.uncc.edu. We demonstrate that the integration of genomic data using a coordinated multiple view approach can simplify the exploration of large comparative genomic data sets, and facilitate reasoning about comparisons and features of interest.

Highlights

To make sense of genomic sequence data, genomes are annotated with information that can include results from the application of computational tools and from laboratory experiments
We demonstrate that the integration of genomic data using a coordinated multiple view approach can simplify the exploration of large comparative genomic data sets, and facilitate reasoning about comparisons and features of interest
Data GenoSets currently supports annotation parsing, to establish the content of the genome, ortholog clustering, to establish consistent gene definitions across an entire set of genomes, and Gene Ontology (GO) term assignment, to provide a means for further categorizing gene content based on apparent function

Summary

Introduction

Background To make sense of genomic sequence data, genomes are annotated with information that can include results from the application of computational tools and from laboratory experiments. This layered set of information describes the location of features, their similarities with other known features, and their functional and contextual properties. A comparative analysis system must support two major types of operations: defining regions on a single genome based on some property or content information (annotative operations), and defining relationships between regions in one or more genomes based on a comparative analysis (comparative operations)

Methods

Results

Conclusion