Abstract

Motivation: Robust large-scale sequence analysis is a major challenge in modern genomic science, where biologists are frequently trying to characterize many millions of sequences. Here, we describe a new Java-based architecture for the widely used protein function prediction software package InterProScan. Developments include improvements and additions to the outputs of the software and the complete reimplementation of the software framework, resulting in a flexible and stable system that is able to use both multiprocessor machines and/or conventional clusters to achieve scalable distributed data analysis. InterProScan is freely available for download from the EMBl-EBI FTP site and the open source code is hosted at Google Code.Availability and implementation: InterProScan is distributed via FTP at ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/ and the source code is available from http://code.google.com/p/interproscan/.Contact: http://www.ebi.ac.uk/support or interhelp@ebi.ac.uk or mitchell@ebi.ac.uk

Highlights

  • The InterProScan software (Quevillon et al, 2005) is extensively used both by genome sequencing projects [Suen et al, 2011; Shulaev et al, 2011; Sato et al, 2011] and the UniProt Knowledgebase (UniProtKB) (The UniProt Consortium, 2012) to obtain a ‘first-pass’ profile of protein sequences’ potential functions

  • Before describing the architecture used by the new version of InterProScan, it is necessary to explain how these analysis applications work in a general sense, as it has influenced the overall design of the system

  • Once the search results are obtained, the InterProScan in-memory database is queried to find corresponding InterPro (Hunter et al, 2012) entries and additional database annotations, such as Gene Ontology (The Gene Ontology Consortium, 2000) terms, are associated with the results

Read more

Summary

INTRODUCTION

The InterProScan software (Quevillon et al, 2005) is extensively used both by genome sequencing projects [Suen et al, 2011; Shulaev et al, 2011; Sato et al, 2011] and the UniProt Knowledgebase (UniProtKB) (The UniProt Consortium, 2012) to obtain a ‘first-pass’ profile of protein sequences’ potential functions. It does this by combining together search applications that predict protein family membership and the presence of functional domains and sites, summarizing their outputs. This reimplementation of InterProScan addresses the previous versions’ weaknesses and adds new features to the software

SOFTWARE ARCHITECTURE
Job management
New analysis algorithm and features
Match lookup service
Installation and configuration
INTERFACES AND ACCESS
Findings
Methods
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call