Analytical Queries for Unstructured Data
Analytical Queries for Unstructured Data
- Research Article
36
- 10.1016/j.jbi.2015.12.005
- Dec 17, 2015
- Journal of Biomedical Informatics
Unstructured medical image query using big data – An epilepsy case study
- Book Chapter
- 10.5772/36863
- Feb 1, 2012
Traditional data warehousing has been very successful in helping business enterprises to make intelligent decisions through declarative analysis of large amount of structured data stored in a relational database. However, not all enterprise data naturally fit into a relational model. Within an enterprise, there are huge amount of unstructured data, such as document content, emails, spreadsheets, that do not have a fixed schema, or have a very sparse or loose schema that cannot be effectively modeled using relational model. Yet, like relational data, unstructured data record many useful facts that are equally essential and important to be analyzed by businesses to make intelligent decisions. In this chapter, we propose an XML-enabled RDBMS that uses XML as the underlying logical data model to uniformly represent both well-structured relational data, semi-structured and unstructured data in building an enterprise data warehouse that is able to store and analyze any data regardless of existence of schema or not. We show how XQuery used in SQL/XML as a declarative language to do data query, analysis and transformation over both structured data and unstructured content in the data warehouse. We present the rationale for using XML as the logical data model for unified data warehouse query, XML extended inverted text index to integrate structured data query and context aware full text search for unstructured content so as to support efficient data analysis over large volume of structured and unstructured data. We argue that the technical approach of using XML to unify both structured and unstructured data in a warehouse has the potential to push business intelligence over all enterprise data to a new era.
- Conference Article
2
- 10.1117/12.2011072
- Mar 20, 2013
Data warehouse is a technology designed for supporting decision making. Data warehouse is made by extracting large amount of data from different operational systems; transforming it to a consistent form and loading it to the central repository. The type of queries in data warehouse environment differs from those in operational systems. In contrast to operational systems, the analytical queries that are issued in data warehouses involve summarization of large volume of data and therefore in normal circumstance take a long time to be answered. On the other hand, the result of these queries must be answered in a short time to enable managers to make decisions as short time as possible. As a result, an essential need in this environment is in improving the performances of queries. One of the most popular methods to do this task is utilizing pre-computed result of queries. In this method, whenever a new query is submitted by the user instead of calculating the query on the fly through a large underlying database, the pre-computed result or views are used to answer the queries. Although, the ideal option would be pre-computing and saving all possible views, but, in practice due to disk space constraint and overhead due to view updates it is not considered as a feasible choice. Therefore, we need to select a subset of possible views to save on disk. The problem of selecting the right subset of views is considered as an important challenge in data warehousing. In this paper we suggest a Weighted Based Genetic Algorithm (WBGA) for solving the view selection problem with two objectives.
- Conference Article
97
- 10.1109/icde.2000.839454
- Feb 29, 2000
XML is here as the Internet standard for information exchange among e-businesses and applications. With its dramatic adoption and its ability to model structured, unstructured and semi-structured data, XML has the potential of becoming the data model for Internet data. In recent years, Oracle has evolved its DBMS to support complex, structured, and un-structured data. Oracle has now extended that technology to enable the storage and querying of XML data by evolving its DBMS to an XML enabled DBMS, Oracle8i. We present Oracle's XML-enabling database technology. In particular, we discuss how XML data can be stored, managed and queried in the Oracle8i database.
- Conference Article
11
- 10.1109/socialinformatics.2012.87
- Dec 1, 2012
The recent development of social media (e.g., Twitter, Facebook, blogs, etc.) provides an unprecedented opportunity to study human social cultural behaviors. These data sources provide rich structured data (e.g., XML, relational tables, and categorical data) as well as unstructured data (e.g., texts). A significant challenge is to summarize and navigate structured data together with unstructured text data for efficient query and analysis. In this paper we introduce a text cube architecture designed to organize social media data in multiple dimensions and hierarchies for efficient information query and visualization from multiple perspectives. For example, an affective process cube allows the analyst to examine public reaction (e.g., sadness, anger) to a range of social phenomena. The text cube architecture also supports the development of prediction models using the summarized statistics stored in a data cube. For example, models that detect events, such as violent protests in the Egyptian Revolution, can be built using the linguistic features stored in an event data cube. These kinds of models represent higher level of knowledge representation and may help to develop more effective strategies for decision-making based on social media data.
- Research Article
5
- 10.1109/tkde.2012.200
- Nov 1, 2013
- IEEE Transactions on Knowledge and Data Engineering
We present the Caicos system that supports continuous infidelity bounded queries over a data stream, where each data item (of the stream) belongs to multiple categories. Caicos is made up of four primitives: Keywords, Queries, Data items, and Categories. A Category is a virtual entity consisting of all those data items that belong to it. The membership of a data item to a category is decided by evaluating a Boolean predicate (associated with each category) over the data item. Each data item and query in turn are associated with multiple keywords. Given a keyword query, unlike conventional unstructured data querying techniques that return the top-(K) documents, Caicos returns the top-(K) categories with infidelity less than the user specified infidelity bound. Caicos is designed to continuously track the evolving information present in a highly dynamic data stream. It, hence, computes the relevance of a category to the continuous keyword query using the data items occurring in the stream in the recent past (i.e., within the current window). To efficiently provide up-to-date answers to the continuous queries, Caicos needs to maintain the required metadata accurately. This requires addressing two subproblems: 1) Identifying the metadata that needs to be updated for providing accurate results and 2) updating the metadata in an efficient manner. We show that the problem of identifying the right metadata can be further broken down into two subparts. We model the first subpart as an inequality constrained minimization problem and propose an innovative iterative algorithm for the same. The second subpart requires us to build an efficient dynamic programming-based algorithm, which helps us to find the right metadata that needs to be updated. Updating the metadata on multiple processors is a scheduling problem whose complexity is exponential in the length of the input. An approximate multiprocessor scheduling algorithm is, hence, proposed. Experimental evaluation of Caicos using real-world dynamic data shows that Caicos is able to provide fidelity close to 100 percent using 45 percent less resources than the techniques proposed in the literature. This ability of Caicos to work accurately and efficiently even in scenarios with high data arrival rates makes it suitable for data intensive application domains.
- Research Article
2
- 10.1007/s12008-021-00768-y
- Sep 2, 2021
- International Journal on Interactive Design and Manufacturing (IJIDeM)
Manufacturing industry data are distributed, heterogeneous and numerous, resulting in different challenges including fast, exhaustive and relevant querying of data. In order to provide an innovative answer to this challenge, the authors consider an information retrieval system based on a graph database. In this paper, the authors focus on determining the key issues to consider in this context. The authors define a three-step methodology using root causes analysis. This methodology is then applied to a data set and queries representative of an industrial use case. As a result, the authors list four main issues: (i) semantic extension of keyword search, (ii) the treatment of syntactic heterogeneity contained in unstructured data, (iii) the results treatment by relevance order and (iv) the detection of relationships between a priori unrelated data. The authors conclude by discussing potential resolutions of these four issues, suggest adapting the methodology used in the paper to evaluate a future proposal, and finally open the possibility of using the results beyond the manufacturing domain.
- Research Article
- 10.4028/www.scientific.net/amm.48-49.1271
- Feb 1, 2011
- Applied Mechanics and Materials
Much information in MES is unstructured, such as drawing and document. It is very important for MES to reasonably manage and reuse the unstructured information. A new management strategy of the unstructured information is put forward in the paper, First the basic management method and realization procedure are simply presented; Second the technologies related to the management strategy such as layout analysis, content structured, data management based on xml schema and data query & analysis are analyzed in details; Last the practical use and further research direction of the strategy are depicted. The value of the unstructured information can greatly enhanced through the strategy put forward in the paper.
- Research Article
4
- 10.1007/s10586-017-1320-7
- Nov 16, 2017
- Cluster Computing
Big data is centered upon the technique of expanding volume of high velocity, intricate along with different kind of data. Organizations that hold vast sum of data deals with new creation of systematic tools intended for large data. The conventional data-intensive business application starts to go down behind the times, on account of the deficient abilities to manage large data volumes, unstructured information, low rate of information retrieval along with complex algorithms. Big data relies upon the data complexity, relatively than the data size only. For resolving this kind of trouble, this paper establishes a mutual refinement technique for big data retrieval to augment the performance. The intended system comprises the system of training and retrieval which is performed consecutively. In training process, initially input data is preprocessed by splitting the data. Then frequency and entropy features are extracted from the preprocessed data. After the feature extraction data is exhibited to the mutual refinement process. In mutual refinement step hash tag graph is generated to train the data and this removes the uncertainty from the data. In retrieval process, the input query data is used for the similarity assessment. Features like frequency and entropy are extracted from the query data. Then the feature value is compared with the hash tag graph. If the feature value is matched then the data is retrieved as of the hash tag graph and the retrieved data is visualized. The proposed technique’s performance is assessed by relating our intended work with the other conventional works. The experimental output exhibits that our intended mutual refinement process augments the system performance process by confiscating the uncertainty comprised in the system. This work offered a unique mutual refinement approach which yields better outcomes for retrieving the big data in a proficient manner. The proposed process retrieving process in big data gives the better performance but in future, experiments can be done on large datasets and some real-time applications to calculate the effectiveness of the proposed method.
- Conference Article
2
- 10.1109/dasc.2013.135
- Dec 1, 2013
With the rapid development of information technology, the needs of unstructured data storage and processing is growing rapidly, which develops a new requirement for the database storage. Traditional row-oriented relational databases appear to be inadequate for the data query and analysis. In this paper, we propose a novel approach to store the unstructured data in a relational database. By splitting the VALUE property of the unstructured KEY/VALUE data and recreating the two-dimensional data, the original data can be stored in relational databases. The system introduced in this paper is designed to handle this task. In addition, this system rebuilds the SQL as its query language, which makes it compatible with relational databases. In experiments of the query for unstructured data, the outcomes show that the system is good at decomposing the SQL statement submitted by users, and generating the corrected sub-query statements. The results of the experiments show that the performance of this system is good.
- Research Article
- 10.32985/ijeces.15.7.3
- Jul 12, 2024
- International journal of electrical and computer engineering systems
Every government in the world has multiple departments that must function and operate to address the various inquiries raised by the population. The government's diverse range of websites offers citizens a platform to submit inquiries, thereby facilitating the fulfilling of their requirements. Comprehending the subjects addressed in People Query is essential for government services. Unstructured query data is analyzed using extracting information from text techniques such as allocation of Latent diffuser (LDA) and analysis of hidden semantics (LSA). LSA outperforms other methods in terms of performance because of its minimal complexity and quick installation process. Research on decentralized learning techniques for natural language processing (NLP) is necessary due to concerns about limited data availability and privacy. Federated learning (FL) employs methods that enable different users to collectively train an integrated broad model while maintaining their information regionally stored and accessible. Nevertheless, the current body of literature lacks a thorough examination and evaluation of FL techniques. Data federation is an approach to data integration that allows the government to access and query data from multiple diverse sources as if they were a single, unified repository. Functioning as a form of data virtualization, it facilitates the creation of a comprehensive representation of data, thereby enhancing operational efficiency and the accuracy of decision-making. FedEx utilizes Federated Learning to apply topic modelling techniques to common NLP tasks. The proposed structure integrates the FL Methodology with Latent Semantic Analysis to deliver outcomes for intelligent data analysis and management.
- Research Article
- 10.4028/www.scientific.net/amm.713-715.2423
- Jan 1, 2015
- Applied Mechanics and Materials
With the rapid development of the Internet of Things technology, IOT terminal equipments collect a large amount of data. So the preprocessing and storage become a big challenge. In this paper we presents a universal preprocessing and storage architecture for IOT data in cloud environment. In the data preprocessing module, with the sensor equipments are not stable, a large number of mistake, missing data will be produced, so we propose a imputation algorithm based on clustering to perform data preprocessing. For the data storage module, because of the existence of a large number of unstructured and semi-structured data, we present a storage architecture for heterogeneous data in cloud environment. Experiments show that our architecture can effectively complete the data preprocessing and storage, for the subsequent work, such as query data provides good support.
- Conference Article
2
- 10.1145/3372938.3372947
- Oct 23, 2019
Data partitioning is a well-known technique for decision-support query performance optimization. In this paper, we present a horizontal data partitioning approach tailored to a large data warehouse, interrogated through a high number of queries. The idea behind our approach is to partition horizontally only the large fact table based on partitioning predicates, elected from the set of the selection predicates used by the analytic queries. The partitioning predicates election depends on their numbers of occurrences, their access frequencies, and their selectivities. With the Star Scheme Benchmark under Oracle 12c, we demonstrate that our partitioning technique reduces both query response time and fact partitions number; which is the major drawback of existing partitioning techniques. We also show, that our partitioning algorithm is around 66% faster compared to the primary and derived partitioning techniques based on the genetic algorithm.
- Book Chapter
7
- 10.1007/978-81-322-2550-8_24
- Jan 1, 2015
Evolution of Web 2.0 has rapidly contributed to the volume and variety of data. Semi structured and unstructured data are various varieties generated by different sources in Web 2.0. The challenge is to handle semi structured and unstructured data which does not have any consistent format. Handling semi structured data, where data has varying formats urges a need for a DBMS to be less restrictive on the structure of the stored data. This paper discusses features, available data model and query model for NoSQL databases which are competent to handle semi structured data. Document-oriented NoSQL database MongoDB is compared with relational database MySQL in terms of evaluating the query response time. This comparison is presented as a case study for News dataset. News items are collected from various news channels in the form of RSS feeds which generate data in varying formats essentially exhibiting the property of being semi structured. Handling RSS feeds using relational database requires defining a schema and requires preprocessing the feeds. On the other hand, this data generated by heterogeneous data sources can be efficiently handled by NoSQL without any preprocessing. Result of comparison of NoSQL database MongoDB with relational database MySQL shows that NoSQL databases are better than relational database for semi structured data in terms of fabricating the structure of database and in query response time.
- Dissertation
- 10.22215/etd/2023-15426
- Apr 3, 2023
Research on effective usage of Machine Learning (ML) and Natural Language Processing (NLP) techniques are taken up to mitigate the problem of extracting information from huge volumes of unstructured data available on the Internet without losing valuable information. Constructing Knowledge Graph is one such application to query and extract unstructured data. The data is passed through a coreference resolution module using Neuralcoref, a named entity linking module using Wikifier API, and a relationship extraction module using two models, namely, OpenNRE and REBEL, and stores the results as a KG in Neo4j with its corresponding entities and relationships. Experiments were conducted on an unstructured dataset (BBC news dataset) containing text data to analyze the results obtained from the pipeline. The results obtained in the relationship extraction stage were analyzed for evaluation purposes and achieved 61.4% and 87% accuracy through the OpenNRE and REBEL models, respectively.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.