Interconnected Databases Research Articles

Purpose: The hypothesis of this long term project is that a multicentric based information system based on four modules (multiparametric interconnected healthcare databases, data mining tools, updated machine learning based predictive algorithms and user interfaces) will facilitate and accelerate research in oncology. We call this approach “Machine Learning Based Clinical Research (MLBCR)”. We performed a pilot project in non‐small cell lung cancer (NSCLC) patients for which clinical TNM stage is highly inaccurate for the prediction of survival of non‐surgical patients and alternatives are currently lacking. The objectives of this study were to develop and validate a prediction model for survival of NSCLC patients, treated with (chemo) radiotherapy, using clinical factors. Patients and Methods: Three interconnected databases were mirrored into a data warehouse using a disease based, cohort‐specific data model. The three data sources were a) electronic medical records, b) imaging and DICOM‐RT objects in a RT‐PACS and c) treatment information in a record and verify database. Data from 403 consecutive inoperable NSCLC patients, stage I‐IIIB, treated radically with (chemo) radiation were selected. In 82 patients data from blood samples were available. The 2‐norm Support Vector Machines were used to build the prognostic models. Performance of the models was expressed as the AUC (Area Under the Curve) of the Receiver Operating Characteristic (ROC) and assessed using leave‐one‐out (LOO) cross‐validation. The prognostic model, using clinical factors only, was validated using two external, independent datasets with 36 and 65 patients, respectively. In addition, a risk score was calculated and a nomogram, which is in fact a graphical representation of the risk score, was made for practical use. Results: The model, based on 403 patients and using clinical factors, consisted of gender, WHO performance status, forced expiratory volume (FEV1), number of positive lymph node stations on PET and gross tumor volume (on PET‐CT). The AUC, assessed by LOO cross‐validation, was 0.75 (95% CI 0.70–0.82), while application of the model to the external datasets yielded an AUC of 0.75 and 0.76 respectively. Splitting the MAASTRO cohort into 3 subgroups, based on the risk score, resulted in the identification of a high, medium and low risk group. The 2‐year survival was 66% (95% CI 54%–78%) for the low risk group, 29% (95% CI 21%–37%) for the medium risk group and 14% (95% CI 5%–23%) for the high risk group. If blood biomarkers were available, based on the 82 patients the prognostic model consisted of three additional biomarkers factors: OPN, IL8 and CEA. The LOO AUC was 0.83 (95% CI 0.76–0.94), which is significantly better than the prognostic model using only clinical factors based on the same 82 patients (AUC 0.71, 95% CI 0.60–0.87). Conclusion, the model, using clinical factors, successfully estimates 2‐year survival of NSCLC patients and the performance, assessed internally as well as in two independent datasets, is good. Combining blood biomarkers with clinical factors yielded a significantly better performance than using clinical factors only (AUC: 0.83 vs 0.71). We concluded that MLBCR is feasible. The bottle neck is the availability of external data sets. Therefore, we need to invest in international standards as well in multicentric approaches allowing to recruit more patients, preferably having had different type of treatments, and to have quick access to external validation data sets. Conflict of Interest: This project has been partially funded by Siemens IKM.

Read full abstract

Recent proteomic studies of protein domains require high-throughput and systematic approaches. Since most experiments using protein domains, the modules of protein-protein interactions, require gene cloning, the first experimental step should be retrieving DNA sequences of domain encoding regions from databases. For a large scale proteomic research, however, it is a laborious task to extract a large number of domain sequences manually from several inter-linked databases. We present a new methodology to retrieve DNA sequences of domain encoding regions through automatic database cross-referencing. To extract protein domain encoding regions, it traverses several inter-connected database with validation process. And we applied this method to retrieve all the EGF domain encoding DNA sequences of homo sapiens. This new algorithm was implemented using Python library PAMIE, which enables to cross-reference across distinct databases automatically. Corresponding Author: Sanguk Kim (Email:sukim@ postech.ac.kr) This work was supported by the Korea Research Foundation Grant by the Korean Government (MOEHRD) (KRF-2005-070-C00095) and POSTECH BSRI research fund-2005. Introduction Genome projects are generating vast amounts of data that provide the existence of thousands of new gene products, especially the list of proteins responsible for cellular regulation. However it does not immediately reveal what these proteins do, nor how they are assembled into the molecular machines and functional networks that control cellular behavior (Pawson et al., 2003). Cellular processes and overall molecular architectures of all organisms are largely mediated through elaborate scaffolds of protein-protein interactions. Thus, the high-throughput strategies to study protein-protein interactions, such as yeast two-hybrid screening, have been developed to describe the protein interaction networks and to construct the protein interaction maps in model organisms (Uetz et al., 2000, Li et al., 2004, Ghavidel et al, 2005). However, proteins interact with more than one partner at a time, it is difficult to interpret large scale protein-protein interactions (Santonico et al., 2005). Protein domains represent the modular nature of proteins, which fold independently and often perform specific tasks. While protein domains could interact with several binding partners, they are the single binding modules and interact with only one partner at a time (Santonico et al., 2005). Thus, the domain knowledge can help to obtain a clearer representation of the protein networks. The experiments using protein domains need to extract the sequences of domain encoding regions from distinct databases for gene cloning and protein expression, although this process often performed manually (Yu et al., 2004). However, for the high-throughput proteomic experiments, the manual retrieval is daunting due to the following three reasons. First, it needs to collect the information of hundreds or thousands of protein domains for large scale experiments. Second, domain knowledge is not located in a single source so that one should cross-refer separately updating interconnected databases. Third, iterative extraction process can be erroneous since databases sometimes contain dubious entries and point to missing links. Thus, proper decision making policies are essential to eliminate the database entry errors and to validate the results. Therefore, there are needs to develop bioinformatics methodology for retrieving genetic information of domains encoding region to conduct large scale proteomic researches. Bioinformatics and Biosystems 2006, Vol. 2, No. 1, pp. 94-97 95 Here we developed a methodology to extract protein domain encoding DNA sequence automatically from three distinct databases: Pfam, UniProt and GeneBank (Finn et al., 2006, Wu et al., 2006, Benson et al., 2006) using Python library PAMIE. The algorithm also includes the validation process to verify the retrieved data. We applied this method to extract all the EGF domain encoding regions of homo sapiens for further large-scale proteomic experiments. The EGF (Epidermal Growth Factor) domain is a widely distributed, independently folding protein module that is thought to play a general role in extracelluar events such as adhesion, coagulation, and receptor-ligand interactions (Downing et al, 1996). Figure 1. The Algorithm of retrieving domain encoding sequences through database cross-referencing

Read full abstract

Interconnected Databases Research Articles

Articles published on Interconnected Databases

INTEGRATED DATABASE SYSTEM FOR MOBILE DIETARY ASSESSMENT AND ANALYSIS.

TU‐D‐AUD A‐02: Machine Learning Based Clinical Research: The Example of Lung Cancer

Retrieving Protein Domain Encoding DNA Sequences Automatically Through Database Cross-referencing

Mobile Elements as a Combination of Functional Modules

HuGeMap: a distributed and integrated Human Genome Map database.

Protocols for Integrity Constraint Checking in FederatedDatabases

Coping with the imprecision of the real world

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Interconnected Databases Research Articles

Articles published on Interconnected Databases

INTEGRATED DATABASE SYSTEM FOR MOBILE DIETARY ASSESSMENT AND ANALYSIS.

TU‐D‐AUD A‐02: Machine Learning Based Clinical Research: The Example of Lung Cancer

Retrieving Protein Domain Encoding DNA Sequences Automatically Through Database Cross-referencing

Mobile Elements as a Combination of Functional Modules

HuGeMap: a distributed and integrated Human Genome Map database.

Protocols for Integrity Constraint Checking in FederatedDatabases

Coping with the imprecision of the real world