Abstract
The protein sequence space is vast and diverse, spanning across different families. Biologically meaningful relationships exist between proteins at superfamily level. However, it is highly challenging to establish convincing relationships at the superfamily level by means of simple sequence searches. It is necessary to design a rigorous sequence search strategy to establish remote homology relationships and achieve high coverage. We have used iterative profile-based methods, along with constraints of sequence motifs, to specify search directions. We address the importance of multiple start points (queries) to achieve high coverage at protein superfamily level. We have devised strategies to employ a structural regime to search sequence space with good specificity and sensitivity. We employ two well-known sequence search methods, PSI-BLAST and PHI-BLAST, with multiple queries and multiple patterns to enhance homologue identification at the structural superfamily level. The study suggests that multiple queries improve sensitivity, while a pattern-constrained iterative sequence search becomes stringent at the initial stages, thereby driving the search in a specific direction and also achieves high coverage. This data mining approach has been applied to the entire structural superfamily database.
Highlights
Protein sequence databases have grown enormously in recent times
The sequence search strategy devised for remote homology detection was tested and implemented
In the multiple queries (MQ) approach, all the members from each of the 12 selected superfamilies were used as inputs for PSI-BLAST to search against the non-redundant protein database (NR-Db)
Summary
Protein sequence databases have grown enormously in recent times. Understanding protein homology within such huge sets of sequences requires tracing the divergence by mutation, substitution, insertion and deletion of residues[1,2]. Homologous proteins reflect similarity at sequence and structural levels, implying functional similarity[3]. This level of similarity broadens into the superfamily and the ways to deduce such relationships differ for both protein sequence and structure information[4,5]. There are different databases that organize sets of homologous proteins or protein superfamilies based on protein sequence and structure. These databases primarily employ protein domain information present in a sequence or structure. SCOP is a database that organizes the protein structural domain data in different hierarchical levels based on structural and functional information[6]. Structure-based classification is helpful to explore sequence space and helps in functional assignments by association of protein sequences[9]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.