Abstract

Rapid progress in sequencing during the last few years has led to the publication of the sequences of around 70 complete genomes. At least twice that number is expected to be completed by the end of 2002. The availability of complete genomic data allows a whole range of new large-scale experiments such as the generation of whole-genome gene expression data, high-throughput protein identification by mass spectrometry, analysis of protein–protein interactions by two-hybrid-systems, phage-display, tandem–affinity–purification or other methods. These and other types of high-throughput experiments all produce large quantities of data, which must be stored in robust databases to enable their analysis and exploitation. Much of this data is currently spread over many databases with differing structures and locations making it difficult for users to have an integrated view of the information. To cope with data of such magnitude and complexity, higher interoperability of databases is essential. Traditionally data distribution in the life science domain takes the form of exchanges of ‘flat–files’, ie., ASCII text files in a database-specific format. Commercial and academic data providers and endusers retrieve these data sets, write tools to parse them and reformat them usually in their own format to access them with their internal analysis tools. Due to the dramatic increase of the quantity and complexity of biological data, it is clear that distribution and storage of data in flat-files will have to be replaced in the future by more appropriate systems. Various initiatives in the domain, e.g. SRS [4] or Entrez [3], are focused on access to internal resources, by concentrating all the data in one central site and thus offering integrated views. The biggest drawback of such an approach is that these resources can only offer up-to-date data for data collections maintained on site. Data from external providers integrated in such a system will never be up-to-date, and updating and maintaining local copies of external data collections in such centralised databases or data warehouses is a major task. Another approach is the federation of different databases; each located at a different centre. With all major molecular biology databases available on the web, this has happened, on a very low level, by the use of database cross-references providing links from one database to one or many other related resources. This linking is initially easy to achieve, Comparative and Functional Genomics Comp Funct Genom 2002; 3: 47–50. DOI: 10.1002 / cfg.133

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.