Protein Construction-Based Data Partitioning Scheme for Alignment of Protein Macromolecular Structures Through Distributed Querying in Federated Databases.

Dariusz Mrozek,Jacek Kwiendacz,Bozena Malysiak-Mrozek

doi:10.1109/tnb.2019.2930494

Dariusz Mrozek, Jacek Kwiendacz + Show 1 more

Open Access

https://doi.org/10.1109/tnb.2019.2930494

Copy DOI

Abstract

Exploration of various characteristics of 3D protein structures through querying relational databases storing the structures can be challenging due to the necessity to conform to a particular database schema. However, this also brings several advantages, like the ability to perform extensive database searches with declarative SQL language, protect data against hardware damages through regular backup mechanisms, and secure data against unauthorized access. Since relational databases do not provide exploration methods specific for protein data and its biological semantics, like searches on the basis of protein structural patterns, the use of relational databases in this domain is still rare and requires the development of dedicated methods to increase the speed of data exploration techniques. In this paper, we show a novel data partitioning scheme for distributing data across database clusters that can be used for performing sophisticated explorations of 3D protein structures. The data partitioning scheme relies on protein construction, which requires data preprocessing but results in shorter exploration times through querying federated databases. We solve the problem of finding proteins in Oracle relational database on the basis of the similarity of 3D protein structures with the use of distributed PAR-P3D-SQL queries. Since 3D protein structure similarity searching is one of the most time-consuming exploration processes that can be performed for protein data, we make use of a distributed environment of Oracle federated databases, distributed query processing, and dedicated load balancing methods to accelerate the exploration. Results of performed tests confirm that we are able to significantly increase the speed of the exploration process, proportionally to the number of database nodes in the federated environment.

Highlights

IntroductionAs rightly pointed out by Gesing et al [11], in many cases, possessed peptide chains or protein conformations, either predicted or obtained experimentally [12], may constitute an intellectual property that cannot be shared through the Internet
By implementing the PAR-P3D-SQL extension for Oracle PL/SQL we proved that sophisticated exploration of protein data is possible through various computational frameworks and in relational databases. 3D protein structure similarity searching can be performed in a declarative SQL query language in Oracle databases to support various analyses, including protein identification, finding common structural motifs or protein regions responsible for important cellular processes, and verification of predicted protein models
It depends on the data set that is explored, its homogeneity, the algorithm used for 3D protein structure alignment, and applied data partitioning scheme

Summary

Introduction

As rightly pointed out by Gesing et al [11], in many cases, possessed peptide chains or protein conformations, either predicted or obtained experimentally [12], may constitute an intellectual property that cannot be shared through the Internet. They may be an important element in the drug production process and a decisive factor in the profits of a pharmaceutical company. Standard SQL does not support searches on the basis of biological patterns, like protein amino acid sequence, 3D protein structure or a part of it These kinds of searches must take into account the specificity of protein data. They require completely different algorithmic approaches for flexible searching that are based on sequence alignment [20], [21] or structure alignment [22] (Fig. 1)

Methods

Results

Discussion

Conclusion