Analyzing related raw data files through dataflows
SummaryComputer simulations may ingest and generate high numbers of raw data files. Most of these files follow a de facto standard format established by the application domain, for example, Flexible Image Transport System for astronomy. Although these formats are supported by a variety of programming languages, libraries, and programs, analyzing thousands or millions of files requires developing specific programs. Database management systems (DBMS) are not suited for this, because they require loading the raw data and structuring it, which becomes heavy at large scale. Systems like NoDB, RAW, and FastBit have been proposed to index and query raw data files without the overhead of using a database management system. However, these solutions are focused on analyzing one single large file instead of several related files. In this case, when related files are produced and required for analysis, the relationship among elements within file contents must be managed manually, with specific programs to access raw data. Thus, this data management may be time‐consuming and error‐prone. When computer simulations are managed by a scientific workflow management system (SWfMS), they can take advantage of provenance data to relate and analyze raw data files produced during workflow execution. However, SWfMS registers provenance at a coarse grain, with limited analysis on elements from raw data files. When the SWfMS is dataflow‐aware, it can register provenance data and the relationships among elements of raw data files altogether in a database, which is useful to access the contents of a large number of files. In this paper, we propose a dataflow approach for analyzing element data from several related raw data files. Our approach is complementary to the existing single raw data file analysis approaches. We use the Montage workflow from astronomy and a workflow from Oil and Gas domain as data‐intensive case studies. Our experimental results for the Montage workflow explore different types of raw data flows like showing all linear transformations involved in projection simulation programs, considering specific mosaic elements from input repositories. The cost for raw data extraction is approximately 3.7% of the total application execution time. Copyright © 2015 John Wiley & Sons, Ltd.
- Research Article
21
- 10.1016/j.future.2017.01.016
- Jan 11, 2017
- Future Generation Computer Systems
Raw data queries during data-intensive parallel workflow execution
- Conference Article
5
- 10.1109/sbac-padw.2014.32
- Oct 1, 2014
Scientific applications generate raw data files in very large scale. Most of these files follow a standard format established by the domain area application, like HDF5, Net CDF and FITS. These formats are supported by a variety of programming languages, libraries and programs. Since they are in large scale, analyzing these files require writing a specific program. Generic data analysis systems like database management systems (DBMS) are not suited because of data loading and data transformation in large scale. Recently there have been several proposals for indexing and querying raw data files without the overhead of using a DBMS, such as noDB, RAW and Fast Bit. Their goal is to offer query support to the raw data file after a scientific program has generated it. However, these solutions are focused on the analysis of one single large file. When a large number of files are all related and required to the evaluation of one scientific hypothesis, the relationships must be managed manually or by writing specific programs. The proposed approach takes advantage of existing provenance data support from Scientific Workflow Management Systems (SWfMS). When scientific applications are managed by SWfMS, the data is registered along the provenance database at runtime. Therefore, this provenance data may act as a description of theses files. When the SWfMS is dataflow aware, it registers domain data all in the same database. This resulting database becomes an important access method to the large number of files that are generated by the scientific workflow execution. This becomes a complementary approach to the single raw data file analysis support. In this work, we present our dataflow approach for analyzing data from several raw data files and evaluate it with the Montage application from the astronomy domain.
- Research Article
- 10.5075/epfl-thesis-6644
- Jan 1, 2015
Nowadays, business and scientific applications accumulate data at an increasing pace. This growth of information has already started to outgrow the capabilities of database management systems (DBMS). In a typical DBMS usage scenario, the user should define a schema, load the data and tune the system for an expected workload before submitting any queries. Copying data into a database is a significant investment in terms of time and resources, and in many cases unnecessary or even no longer feasible in practice due to the explosive data growth. Additionally, the way DBMS store and organize data during data loading defines how data will be accessed for a given workload and thus, the maximum performance. Selecting the underlying data layout (row-store or column-store) is a critical first tuning decision which cannot change. Nevertheless, today query analysis is not static; it evolves as queries change. Hence, static design decisions can be suboptimal. In this thesis, we advocate in situ query processing as the principal way to manage data in a database. We reconsider the data loading phase and redesign traditional query processing architectures to work efficiently over raw data files to address the heavy initialization cost that comes with data loading. We present adaptive data loading as an alternative to traditional full a priori data loading. We explore the potential of in situ query processing in the context of current DBMS architectures. We identify performance bottlenecks specific for in situ processing and we introduce an adaptive indexing mechanism (positional map) that maintains positional information to provide efficient access to raw data files, together with a flexible caching structure and techniques for collecting statistics over raw data files. Moreover, we design a flexible query engine that is not built around a single storage layout but it can exploit different storage layouts and data execution strategies in a single engine. It decides during query processing, which design fits the input queries and properly adapts the underlying data storage. By applying code generation techniques, we dynamically generate access operators tailored for specific classes of queries. This thesis revises the traditional paradigm of loading, tuning and then querying by using in situ query processing as the principal way to minimize data-to-query time. We show that raw data files should not be considered ``outside'' the DBMS and full data loading should not be a requirement to exploit database technology. On the contrary, proper techniques specifically tailored to overcome limitations that come with accessing raw data files can eliminate the data loading overhead making, therefore, raw data files a first-class citizen, fully integrated with the query engine. The proposed roadmap can provide guidance on how to convert any traditional DBMS into an efficient in situ query engine.
- Conference Article
30
- 10.1145/1646468.1646470
- Nov 16, 2009
One of the main advantages of using a scientific workflow management system (SWfMS) to orchestrate data flows among scientific activities is to control and register the whole workflow execution. The execution of activities within a workflow with high performance computing (HPC) presents challenges in SWfMS execution control. Current solutions leave the scheduling to the HPC queue system. Since the workflow execution engine does not run on remote clusters, SWfMS are not aware of the parallel strategy of the workflow execution. Consequently, remote execution control and provenance registry of the parallel activities is very limited from the SWfMS side. This work presents a set of components to be included on the workflow specification of any SWMfS to control parallelization of activities as MTC. In addition, these components can gather provenance data during remote workflow execution. Through these MTC components, the parallelization strategy can be registered and reused, and provenance data can be uniformly queried. We have evaluated our approach by performing parameter sweep parallelization in solving the incompressible 3D Navier-Stokes equations. Experimental results show the performance gains with the additional benefits of distributed provenance support.
- Research Article
19
- 10.1002/mp.12128
- Mar 14, 2017
- Medical Physics
Lung cancer screening with low-dose CT has recently been approved for reimbursement, heralding the arrival of such screening services worldwide. Computer-aided detection (CAD) tools offer the potential to assist radiologists in detecting nodules in these screening exams. In lung screening, as in all CT exams, there is interest in further reducing radiation dose. However, the effects of continued dose reduction on CAD performance are not fully understood. In this work, we investigated the effect of reducing radiation dose on CAD lung nodule detection performance in a screening population. The raw projection data files were collected from 481 patients who underwent low-dose screening CT exams at our institution as part of the National Lung Screening Trial (NLST). All scans were performed on a multidetector scanner (Sensation 64, Siemens Healthcare, Forchheim Germany) according to the NLST protocol, which called for a fixed tube current scan of 25 effective mAs for standard-sized patients and 40 effective mAs for larger patients. The raw projection data were input to a reduced-dose simulation software to create simulated reduced-dose scans corresponding to 50% and 25% of the original protocols. All raw data files were reconstructed at the scanner with 1 mm slice thickness and B50 kernel. The lungs were segmented semi-automatically, and all images and segmentations were input to an in-house CAD algorithm trained on higher dose scans (75-300 mAs). CAD findings were compared to a reference standard generated by an experienced reader. Nodule- and patient-level sensitivities were calculated along with false positives per scan, all of which were evaluated in terms of the relative change with respect to dose. Nodules were subdivided based on size and solidity into categories analogous to the LungRADS assessment categories, and sub-analyses were performed. From the 481 patients in this study, 82 had at least one nodule (prevalence of 17%) and 399 did not (83%). A total of 118 nodules were identified. Twenty-seven nodules (23%) corresponded to LungRADS category 4 based on size and composition, while 18 (15%) corresponded to LungRADS category 3 and 73 (61%) corresponded to LungRADS category 2. For solid nodules ≥8 mm, patient-level median sensitivities were 100% at all three dose levels, and mean sensitivities were 72%, 63%, and 63% at original, 50%, and 25% dose, respectively. Overall mean patient-level sensitivities for nodules ranging from 3 to 45 mm were 38%, 37%, and 38% at original, 50%, and 25% dose due to the prevalence of smaller nodules and nonsolid nodules in our reference standard. The mean false-positive rates were 3, 5, and 13 per case. CAD sensitivity decreased very slightly for larger nodules as dose was reduced, indicating that reducing the dose to 50% of original levels may be investigated further for use in CT screening. However, the effect of dose was small relative to the effect of the nodule size and solidity characteristics. The number of false positives per scan increased substantially at 25% dose, illustrating the importance of tuning CAD algorithms to very challenging, high-noise screening exams.
- Book Chapter
7
- 10.1007/978-3-642-17819-1_28
- Jan 1, 2010
Scientific experiments present several advantages when modeled at high abstraction levels, independent from Scientific Workflow Management System (SWfMS) specification languages. For example, the scientist can define the scientific hypothesis in terms of algorithms and methods. Then, this high level experiment can be mapped into different scientific workflow instances. These instances can be executed by a SWfMS and take advantage of its provenance records. However, each workflow execution is often treated by the SWfMS as independent instances. There are no tools that allow modeling the conceptual experiment and linking it to the diverse workflow execution instances. This work presents GExpLine, a tool for supporting experiment composition through provenance. In an analogy to software development, it can be seen as a CASE tool while a SWfMS can be seen as an IDE. It provides a conceptual representation of the scientific experiment and automatically associates workflow executions with the concept of experiment. By using prospective provenance from the experiment, GExpLine generates corresponding workflows that can be executed by SWfMS. This paper also presents a real experiment use case that reinforces the importance of GExpLine and its prospective provenance support.
- Research Article
13
- 10.1007/s10586-019-02920-6
- Mar 9, 2019
- Cluster Computing
Scientific workflows are abstractions composed of activities, data and dependencies that model a computer simulation and are managed by complex engines named scientific workflow management system (SWfMS). Many workflows demand many computational resources once their executions may involve a number of different programs processing a massive volume of data. Thus, the use of high-performance computing (HPC) and data-intensive scalable computing environments allied to parallelization techniques provides the necessary support for the execution of such workflows. Clouds are environments that already offer HPC capabilities and workflows can explore them. Although clouds offer advantages such as elasticity and availability, failures are a reality rather than a possibility in this environment. Thus, existing SWfMS must be fault-tolerant. There are several types of fault tolerance techniques used in SWfMS such as Checkpoint/Restart, Re-Execution and Over-provisioning, but it is far from trivial to choose the suitable fault tolerance technique for a workflow execution that is not going to jeopardize the parallel execution. The major problem is that the suitable fault tolerance technique may be different for each workflow, activity or activation since programs associated with activities may present different behaviors. This article aims at analyzing several fault-tolerance techniques in a cloud-based SWfMS named SciCumulus, and recommend the suitable one for user’s workflow activities and activations using machine learning techniques and provenance data, thus aiming at improving resiliency.
- Conference Article
1
- 10.1109/naecon.1995.522026
- May 22, 1995
Joint Modeling and Simulation System (J-MASS) was specified to possess an open systems based architecture to support the Department of Defense Modeling and Simulation needs well into the future. Its open systems architectural design is based on a backplane and agents concept. One of the most important agents is the Modeling Library. The Modeling Library provides a repository for user-developed model components, configuration data, simulations, scenario files, output data, and postprocessing results. The Modeling Library will also store modeling and simulation tools and related data files and J-MASS system source code. What is the technology that will enable the J-MASS Modeling Library to assist the users in organizing their data and the program office in establishing a Test Process Archive for its systems? During the past few years, several studies have been performed to review the fast changing area of data management technology. These studies have looked at the mature technologies of relational database management systems (RDBMS) and the emerging technologies of object-oriented database management systems (OODBMSs) and object servers. This paper will provide an overview of this work. The paper will first identify the J-MASS requirements and then will proceed with a review of various technologies and evaluations that were performed against representative implementations of the emerging technologies. The paper will conclude with the technology recommendation for J-MASS data management.
- Conference Article
- 10.1109/sisy.2016.7601477
- Aug 1, 2016
With the increasing capacity and power of distributed computing infrastructures in silico experiments have gained widespread popularity. The different scientific communities (physics, earthquake science, biologists, etc.) have developed their own Scientific Workflow Management Systems (SWfMS) to provide dynamic execution to the scientists. These SWfMSs differ from each other due to the divergent requirements. However, in spite of their diversity they agree on the strongest need that have not been completely fulfilled yet; runtime user-steering and adaptive dynamic execution. Additionally, when provenance data is collected during execution, provenance based steering also emerges as a big challenge. To support the scientists with special interaction mechanism during runtime we have introduced the so called iPoints, special intervention points where the scientist or the system can take over the control and are able to manipulate workflow execution based on provenance and intermediary data. In our current work we specified these iPoints in IWIR language which was targeted to provide interoperability between four existing well-known SWfMS within the framework of the SHIWA project.
- Research Article
3
- 10.1145/3457145
- May 27, 2021
- Proceedings of the ACM on Human-Computer Interaction
To process a large amount of data sequentially and systematically, proper management of workflow components (i.e., modules, data, configurations, associations among ports and links) in a Scientific Workflow Management System (SWfMS) is inevitable. Managing data with provenance in a SWfMS to support reusability of workflows, modules, and data is not a simple task. Handling such components is even more burdensome for frequently assembled and executed complex workflows for investigating large datasets with different technologies (i.e., various learning algorithms or models). However, a great many studies propose various techniques and technologies for managing and recommending services in a SWfMS, but only a very few studies consider the management of data in a SWfMS for efficient storing and facilitating workflow executions. Furthermore, there is no study to inquire about the effectiveness and efficiency of such data management in a SWfMS from a user perspective. In this paper, we present and evaluate a GUI version of such a novel approach of intermediate data management with two use cases (Plant Phenotyping and Bioinformatics). The technique we call GUI-RISPTS (Recommending Intermediate States from Pipelines Considering Tool-States) can facilitate executions of workflows with processed data (i.e., intermediate outcomes of modules in a workflow) and can thus reduce the computational time of some modules in a SWfMS. We integrated GUI-RISPTS with an existing workflow management system called SciWorCS. In SciWorCS, we present an interface that users use for selecting the recommendation of intermediate states (i.e., modules' outcomes). We investigated GUI-RISPTS's effectiveness from users' perspectives along with measuring its overhead in terms of storage and efficiency in workflow execution.
- Conference Article
5
- 10.1145/3035918.3058743
- May 9, 2017
The ever growing data collections create the need for brief explorations of the available data to extract relevant information before decision making becomes necessary. In this context of data exploration, current data analysis solutions struggle to quickly pinpoint useful information in data collections. One major reason is that loading data in a DBMS without knowing which part of it will actually be useful is a major bottleneck. To remove this bottleneck, state-of-the art approaches perform queries in situ, thus avoiding the loading overhead. In situ query engines, however, are index-oblivious, and lack sophisticated techniques to reduce the amount of data to be accessed. Furthermore, applications constantly generate fresh data and update the existing raw data files whereas state-of-the art in situ approaches support only append-like workloads. In this demonstration, we showcase the efficiency of adaptive indexing and partitioning techniques for analytical queries in the presence of updates. We demonstrate an online partitioning and indexing tuner for in situ querying which plugs to a query engine and offers support for fast queries over raw data files. We present Alpine, our prototype implementation, which combines the tuner with a query executor incorporating in situ query techniques to provide efficient raw data access. We will visually demonstrate how Alpine incrementally and adaptively builds auxiliary data structures and indexes over raw data files and how it adapts its behavior as a side-effect of updates in the raw data files.
- Conference Article
54
- 10.1145/2457317.2457365
- Mar 18, 2013
Scientific workflows are commonly used to model and execute large-scale scientific experiments. They represent key resources for scientists and are enacted and managed by Scientific Workflow Management Systems (SWfMS). Each SWfMS has its particular approach to execute workflows and to capture and manage their provenance data. Due to the large scale of experiments, it may be unviable to analyze provenance data only after the end of the execution. A single experiment may demand weeks to run, even in high performance computing environments. Thus scientists need to monitor the experiment during its execution, and this can be done through provenance data. Runtime provenance analysis allows for scientists to monitor workflow execution and to take actions before the end of it (i.e. workflow steering). This provenance data can also be used to fine-tune the parallel execution of the workflow dynamically. We use the PROV data model as a basic framework for modeling and providing runtime provenance as a database that can be queried even during the execution. This database is agnostic of SWfMS and workflow engine. We show the benefits of representing and sharing runtime provenance data for improving the experiment management as well as the analysis of the scientific data.
- Research Article
7
- 10.1177/019459989411000409
- Apr 1, 1994
- Otolaryngology–Head and Neck Surgery
Interlaboratory variability of rotational chair test results
- Dataset
- 10.21421/d2/hdeuku
- Jun 29, 2020
The VDSA panel dataset (vdsa.icrisat.ac.in) was generated by the International Crops Research Institute for the Semi-Arid Tropics (ICRISAT) in partnership with the Indian Council of Agricultural Research (ICAR) Institutes and the International Rice Research Institute (IRRI). The VDSA has operated over a total period of 40 years from 1975 to 2015 but with discrete periods of data collection. In the most recent period (2009-2014), the period used for this analysis data were collected for a larger number of households and with vastly increased survey efforts focused on detailed data collection covering production information, GPS-measured plots, and 3-weekly household visits to record input and output data for each plot owned/leased by participants. The resultant data set covers the period 2009 and 2015 with 1,129 households participating from 30 villages in 9 states of India (vdsa.icrisat.ac.in/vdsa-map/vdsa-location-map.html). Study sites were selected using a stepwise purposive sampling strategy in order to cover the agro-ecological diversity of the region. The current dataset, based on the VDSA raw data, has been compiled to assess the relationship between farm size and agricultural productivity. The STATA program file (.do file) is also shared along with data. This program imports raw VDSA data and with necessary processing develops the variables needed to run the models to study the relationship between agricultural productivity and plot size. The raw data files for different modules can be downloaded from this dataset or can also be generated from vdsakb.icrisat.ac.in, raw data option, selecting all the available Indian states.
- Single Report
- 10.2172/782428
- Jun 23, 1999
A project to improve the Hanford Site's corrosion monitoring strategy was started in 1995. The project is designed to integrate EN-based corrosion monitoring into the site's corrosion monitoring strategy. In order to monitor multiple tanks, a major focus of this project has been to automate the data collection and analysis process. Data collection and analysis from the early EN corrosion monitoring equipment (241-AZ-101 and 241-AN-107) was primarily performed manually by a trained operator skilled in the analysis of EN data. Thousands of raw data files were collected, manually sorted and stored. Further statistical analysis of these files was performed by manually stripping out data from thousands of raw data files and calculating statistics in a spreadsheet format. Plotting and other graphical display analyses were performed by manually exporting data from the data files or spreadsheet into another plotting or presentation software package. In 1999, an Amulet/PRP system was procured and employed on the 241-AN-102 corrosion monitoring system. A duplicate system was purchased for use on the upcoming 241-AN-105 system. A third system has been procured and will eventually be used to upgrade the 241-AN-107 system. The Amulet software has greatly improved the automation of waste tank EN data analysis. In contrast with previous systems, the Amulet operator no longer has to manually collect, sort, store, and analyze thousands of raw EN data files. Amulet writes all data to a single database. Statistical analysis, uniform corrosion rate, and other derived parameters are automatically calculated in Amulet from the raw data while the raw data are being collected. Other improvements in plotting and presentation make inspection of the data a much quicker and relatively easy task. These and other improvements have greatly improved the speed at which EN data can be analyzed in addition to improving the quality of the final interpretation. The increase in data automation offered by the Amulet software is necessary if multiple tanks are to instrumented and analyzed at the Hanford Site. Although advances in the automation of data analysis have been great, Hanford EN data analysis still demands a highly trained corrosion expert. Neural networks could de-skill the post-data collection analysis procedure and broaden the range of users able to understand and interpret corrosion data. Ultimately, the ability to de-skill data the data analysis process will make or break the use of EN as a plant monitoring tool on a wide scale.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.