Recent proteomic studies of protein domains require high-throughput and systematic approaches. Since most experiments using protein domains, the modules of protein-protein interactions, require gene cloning, the first experimental step should be retrieving DNA sequences of domain encoding regions from databases. For a large scale proteomic research, however, it is a laborious task to extract a large number of domain sequences manually from several inter-linked databases. We present a new methodology to retrieve DNA sequences of domain encoding regions through automatic database cross-referencing. To extract protein domain encoding regions, it traverses several inter-connected database with validation process. And we applied this method to retrieve all the EGF domain encoding DNA sequences of homo sapiens. This new algorithm was implemented using Python library PAMIE, which enables to cross-reference across distinct databases automatically. Corresponding Author: Sanguk Kim (Email:sukim@ postech.ac.kr) This work was supported by the Korea Research Foundation Grant by the Korean Government (MOEHRD) (KRF-2005-070-C00095) and POSTECH BSRI research fund-2005. Introduction Genome projects are generating vast amounts of data that provide the existence of thousands of new gene products, especially the list of proteins responsible for cellular regulation. However it does not immediately reveal what these proteins do, nor how they are assembled into the molecular machines and functional networks that control cellular behavior (Pawson et al., 2003). Cellular processes and overall molecular architectures of all organisms are largely mediated through elaborate scaffolds of protein-protein interactions. Thus, the high-throughput strategies to study protein-protein interactions, such as yeast two-hybrid screening, have been developed to describe the protein interaction networks and to construct the protein interaction maps in model organisms (Uetz et al., 2000, Li et al., 2004, Ghavidel et al, 2005). However, proteins interact with more than one partner at a time, it is difficult to interpret large scale protein-protein interactions (Santonico et al., 2005). Protein domains represent the modular nature of proteins, which fold independently and often perform specific tasks. While protein domains could interact with several binding partners, they are the single binding modules and interact with only one partner at a time (Santonico et al., 2005). Thus, the domain knowledge can help to obtain a clearer representation of the protein networks. The experiments using protein domains need to extract the sequences of domain encoding regions from distinct databases for gene cloning and protein expression, although this process often performed manually (Yu et al., 2004). However, for the high-throughput proteomic experiments, the manual retrieval is daunting due to the following three reasons. First, it needs to collect the information of hundreds or thousands of protein domains for large scale experiments. Second, domain knowledge is not located in a single source so that one should cross-refer separately updating interconnected databases. Third, iterative extraction process can be erroneous since databases sometimes contain dubious entries and point to missing links. Thus, proper decision making policies are essential to eliminate the database entry errors and to validate the results. Therefore, there are needs to develop bioinformatics methodology for retrieving genetic information of domains encoding region to conduct large scale proteomic researches. Bioinformatics and Biosystems 2006, Vol. 2, No. 1, pp. 94-97 95 Here we developed a methodology to extract protein domain encoding DNA sequence automatically from three distinct databases: Pfam, UniProt and GeneBank (Finn et al., 2006, Wu et al., 2006, Benson et al., 2006) using Python library PAMIE. The algorithm also includes the validation process to verify the retrieved data. We applied this method to extract all the EGF domain encoding regions of homo sapiens for further large-scale proteomic experiments. The EGF (Epidermal Growth Factor) domain is a widely distributed, independently folding protein module that is thought to play a general role in extracelluar events such as adhesion, coagulation, and receptor-ligand interactions (Downing et al, 1996). Figure 1. The Algorithm of retrieving domain encoding sequences through database cross-referencing
Read full abstract