Semi-structured Data Sources Research Articles

The increasing need for representing information through more complex structures where semantics and relationships among data objects can be more easily expressed has resulted in many semi-structured data sources. Structure comparison among semi-structured data objects can often reveal valuable information, and hence tree mining has gained a considerable amount of interest in areas such as XML mining, Bioinformatics, Web mining etc. We are primarily concerned with the task of mining frequent ordered induced and embedded subtrees from a database of rooted ordered labeled trees. Our previous contributions consist of the efficient Tree Model Guided (TMG) candidate enumeration approach for which we developed a mathematical model that provides an estimate of the worst case complexity for embedded subtree mining. This potentially reveals computationally impractical situations where one would be forced to constrain the mining process in some way so that at least some patterns can be discovered. This motivated our strategy of tackling the complexity of mining embedded subtrees by introducing the Level of Embedding constraint. Thus, when it is too costly to mine all frequent embedded subtrees, one can decrease the level of embedding constraint gradually down to 1, from which all the obtained frequent subtrees are induced subtrees. In this paper we develop alternative implementations and propose two algorithms MB3-R and iMB3-R, which achieve better efficiency in terms of time and space. Furthermore, we develop a mathematical model for estimating the worst case complexity for induced subtree mining. It is accompanied with a theoretical analysis of induced-embedded subtree relationships in terms of complexity for frequent subtree mining. Using synthetic and real world data we practically demonstrate the space and time efficiency of our new approach and provide some comparisons to the two well know algorithms for mining induced and embedded subtrees.

Read full abstract

Providing an integrated access to multiple heterogeneous sources is a challenging issue in global information systems for cooperation and interoperability. In this context, two fundamental problems arise. First, how to determine if the sources contain semantically related information, that is, information related to the same or similar real-world concept(s). Second, how to handle semantic heterogeneity to support integration and uniform query interfaces. Complicating factors with respect to conventional view integration techniques are related to the fact that the sources to be integrated already exist and that semantic heterogeneity occurs on the large-scale, involving terminology, structure, and context of the involved sources, with respect to geographical, organizational, and functional aspects related to information use. Moreover, to meet the requirements of global, Internet-based information systems, it is important that tools developed for supporting these activities are semi-automatic and scalable as much as possible. The goal of this paper is to describe the MOMIS [4, 5] (Mediator envirOnment for Multiple Information Sources) approach to the integration and query of multiple, heterogeneous information sources, containing structured and semistructured data. MOMIS has been conceived as a joint collaboration between University of Milano and Modena in the framework of the INTERDATA national research project, aiming at providing methods and tools for data management in Internet-based information systems. Like other integration projects [1, 10, 14], MOMIS follows a “semantic approach” to information integration based on the conceptual schema, or metadata, of the information sources, and on the following architectural elements: i) a common object-oriented data model, defined according to the ODL I 3 language, to describe source schemas for integration purposes. The data model and ODL I 3 have been defined in MOMIS as subset of the ODMG-93 ones, following the proposal for a standard mediator language developed by the I 3 /POB working group [7]. In addition, ODL I 3 introduces new constructors to support the semantic integration process [4, 5]; ii) one or more wrappers, to translate schema descriptions into the common ODL I 3 representation; iii) a mediator and a query-processing component, based on two pre-existing tools, namely ARTEMIS [8] and ODB-Tools [3] (available on Internet at http://sparc20.dsi.unimo.it/), to provide an I 3 architecture for integration and query optimization. In this paper, we focus on capturing and reasoning about semantic aspects of schema descriptions of heterogeneous information sources for supporting integration and query optimization. Both semistructured and structured data sources are taken into account [5]. A Common Thesaurus is constructed, which has the role of a shared ontology for the information sources. The Common Thesaurus is built by analyzing ODL I 3 descriptions of the sources, by exploiting the Description Logics OLCD (Object Language with Complements allowing Descriptive cycles) [2, 6], derived from KL-ONE family [17]. The knowledge in the Common Thesaurus is then exploited for the identification of semantically related information in ODL I 3 descriptions of different sources and for their integration at the global level. Mapping rules and integrity constraints are defined at the global level to express the relationships holding between the integrated description and the sources descriptions. ODB-Tools, supporting OLCD and description logic inference techniques, allows the analysis of sources descriptions for generating a consistent Common Thesaurus and provides support for semantic optimization of queries at the global level, based on defined mapping rules and integrity constraints.

Read full abstract

Semi-structured Data Sources Research Articles

Related Topics

Articles published on Semi-structured Data Sources

Examining Knowledge Extraction Processes from Heterogeneous Data Sources

Exploring the use of topological data analysis to automatically detect data quality faults.

How can Transnational Municipal Networks foster local collaborative governance regimes for environmental management?

Building Semantic Knowledge Graphs from (Semi-)Structured Data: A Review

A method of semi-automated ontology population from multiple semi-structured data sources

Automating Data Mart Construction from Semi-structured Data Sources

Secondary use of electronic health records for building cohort studies through top-down information extraction

A low redundancy strategy for keyword search in structured and semi-structured data

Resource specification and intelligent user interaction for federated testbeds using Semantic Web technologies

Semantic Consistency Checking in Building Ontology from Heterogeneous Sources

Improving Situational Awareness for Precursory Data Classification using Attribute Rough Set Reduction Approach

Changing Times for Charities: Performance management in a Third Sector Housing Association

Study on Textual Data Sources Based Real-Time ETL

Mining Induced/Embedded Subtrees using the Level of Embedding Constraint

A Framework for Extracting, Classifying, Analyzing, and Presenting Information from Semi-Structured Web Data Sources

Dealing with Uncertainty in Lexical Annotation

Cyclical Structure Converter (CSC): a System for Handling the Interaction of Structured and Semi-structured Data Sources

Semantic integration of heterogeneous information sources

Semantic integration of semistructured and structured data sources

Grammars have exceptions

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Semi-structured Data Sources Research Articles

Related Topics

Articles published on Semi-structured Data Sources

Examining Knowledge Extraction Processes from Heterogeneous Data Sources

Exploring the use of topological data analysis to automatically detect data quality faults.

How can Transnational Municipal Networks foster local collaborative governance regimes for environmental management?

Building Semantic Knowledge Graphs from (Semi-)Structured Data: A Review

A method of semi-automated ontology population from multiple semi-structured data sources

Automating Data Mart Construction from Semi-structured Data Sources

Secondary use of electronic health records for building cohort studies through top-down information extraction

A low redundancy strategy for keyword search in structured and semi-structured data

Resource specification and intelligent user interaction for federated testbeds using Semantic Web technologies

Semantic Consistency Checking in Building Ontology from Heterogeneous Sources

Improving Situational Awareness for Precursory Data Classification using Attribute Rough Set Reduction Approach

Changing Times for Charities: Performance management in a Third Sector Housing Association

Study on Textual Data Sources Based Real-Time ETL

Mining Induced/Embedded Subtrees using the Level of Embedding Constraint

A Framework for Extracting, Classifying, Analyzing, and Presenting Information from Semi-Structured Web Data Sources

Dealing with Uncertainty in Lexical Annotation

Cyclical Structure Converter (CSC): a System for Handling the Interaction of Structured and Semi-structured Data Sources

Semantic integration of heterogeneous information sources

Semantic integration of semistructured and structured data sources

Grammars have exceptions