Ranking-aware integration and explorative search of distributed bio-data

Marco Masseroli,Giorgio Ghisalberti,Matteo Picozzi

doi:10.14806/ej.18.b.555

Abstract

Motivation and Objectives High-throughput production of both biomolecular data and their annotations is providing a rapidly increasing amount of very valuable information that can potentially help finding also long-searched answers to fundamental biomedical questions. Yet, such data deluge makes difficult to extract the information most reliable and most related to the increasingly complex biomedical questions to be answered, which can simultaneously regard many heterogeneous aspects of single or multiple organisms, biological tissues, cells or biomolecular entities. To address such complex questions, many bio-data about several heterogeneous topics, which are available but dispersed in different data sources, must be searched, extracted, integrated and comprehensively queried. Different approaches have been proposed to combine individual search services available on the Web in order to support such heterogeneous searches (Hull et al., 2006; Nekrutenko, 2010). Yet, they rarely rely on a general model of the services to be integrated and none considers, in the integration process, the often available partial rankings of the data to be integrated. Lately, Search Computing (Ceri et al., 2010) has been proposed as a new software framework to build answers to complex search queries by interacting with a collection of cooperating search services and using ranking and joining of results as the dominant factors for service composition. By leveraging the peculiar features of search services, it offers query approaches, execution plans, plan optimization techniques, query configuration tools, and exploratory user interfaces. Here, we report and discuss our work aimed at supporting the explorative search of heterogeneous distributed bio-data and the automatic integration and global ranking of their individual search results, also taking into account the partial rankings of individual searches. In so doing, we make a step towards the computational support required for complex biomedical question answering and biomedical knowledge discovery. Methods According to the Service Mart modeling approach of Search Computing (Ceri et al., 2010), we selected an initial set of typical biomolecular topics (i.e. Protein, Gene, Gene Expression and Biological Function) and modeled the Service Marts (i.e. the generalized and normalized conceptual description) of the bioinformatics services that provide data regarding such topics. We did so by identifying their main and common attributes and normalizing their names. We also defined the semantic Connection Patterns, i.e. the pair-wise coupling, between Service Marts of services that provide data about different topics. This was done by identifying pairs of normalized attributes of the connected Service Marts and defining their comparison predicates, as conjunctive Boolean expressions, that allow joining their values semantically. In so doing, we defined the Semantic Resource Framework (SRF) depicted in Figure 1, which constitutes the reference used by Search Computing to enable the exploration of the services registered in the framework and integrate the data that they provide (Ceri et al., 2010). Then, using available Search Computing tools, we registered in the Search Computing framework five bioinformatics search services that provide data about the topics and semantic associations described in the biomolecular

Full Text