Data Pipeline Research Articles

ContextMany engineering organizations are reimplementing and extending deep neural networks from the research community. We describe this process as deep learning model reengineering. Deep learning model reengineering — reusing, replicating, adapting, and enhancing state-of-the-art deep learning approaches — is challenging for reasons including under-documented reference models, changing requirements, and the cost of implementation and testing.ObjectivePrior work has characterized the challenges of deep learning model development, but as yet we know little about the deep learning model reengineering process and its common challenges. Prior work has examined DL systems from a “product” view, examining defects from projects regardless of the engineers’ purpose. Our study is focused on reengineering activities from a “process” view, and focuses on engineers specifically engaged in the reengineering process.MethodOur goal is to understand the characteristics and challenges of deep learning model reengineering. We conducted a mixed-methods case study of this phenomenon, focusing on the context of computer vision. Our results draw from two data sources: defects reported in open-source reeengineering projects, and interviews conducted with practitioners and the leaders of a reengineering team. From the defect data source, we analyzed 348 defects from 27 open-source deep learning projects. Meanwhile, our reengineering team replicated 7 deep learning models over two years; we interviewed 2 open-source contributors, 4 practitioners, and 6 reengineering team leaders to understand their experiences.ResultsOur results describe how deep learning-based computer vision techniques are reengineered, quantitatively analyze the distribution of defects in this process, and qualitatively discuss challenges and practices. We found that most defects (58%) are reported by re-users, and that reproducibility-related defects tend to be discovered during training (68% of them are). Our analysis shows that most environment defects (88%) are interface defects, and most environment defects (46%) are caused by API defects. We found that training defects have diverse symptoms and root causes. We identified four main challenges in the DL reengineering process: model operationalization, performance debugging, portability of DL operations, and customized data pipeline. Integrating our quantitative and qualitative data, we propose a novel reengineering workflow.ConclusionsOur findings inform several conclusion, including: standardizing model reengineering practices, developing validation tools to support model reengineering, automated support beyond manual model reengineering, and measuring additional unknown aspects of model reengineering.

Transparency and traceability are essential for establishing trustworthy artificial intelligence (AI). The lack of transparency in the data preparation process is a significant obstacle in developing reliable AI systems which can lead to issues related to reproducibility, debugging AI models, bias and fairness, and compliance and regulation. We introduce a formal data preparation pipeline specification to improve upon the manual and error-prone data extraction processes used in AI and data analytics applications, with a focus on traceability. We propose a declarative language to define the extraction of AI-ready datasets from health data adhering to a common data model, particularly those conforming to HL7 Fast Healthcare Interoperability Resources (FHIR). We utilize the FHIR profiling to develop a common data model tailored to an AI use case to enable the explicit declaration of the needed information such as phenotype and AI feature definitions. In our pipeline model, we convert complex, high-dimensional electronic health records data represented with irregular time series sampling to a flat structure by defining a target population, feature groups and final datasets. Our design considers the requirements of various AI use cases from different projects which lead to implementation of many feature types exhibiting intricate temporal relations. We implement a scalable and high-performant feature repository to execute the data preparation pipeline definitions. This software not only ensures reliable, fault-tolerant distributed processing to produce AI-ready datasets and their metadata including many statistics alongside, but also serve as a pluggable component of a decision support application based on a trained AI model during online prediction to automatically prepare feature values of individual entities. We deployed and tested the proposed methodology and the implementation in three different research projects. We present the developed FHIR profiles as a common data model, feature group definitions and feature definitions within a data preparation pipeline while training an AI model for "predicting complications after cardiac surgeries". Through the implementation across various pilot use cases, it has been demonstrated that our framework possesses the necessary breadth and flexibility to define a diverse array of features, each tailored to specific temporal and contextual criteria.

Data Pipeline Research Articles

Related Topics

Articles published on Data Pipeline

Challenges and practices of deep learning model reengineering: A case study on computer vision

An image processing pipeline for electron cryo-tomography in RELION-5.

A Comprehensive Online Analytical System Coupled with Standardized Data Analysis for the Electrochemical Reduction of CO2

Common sleep data pipeline for combined data sets.

Strategic Data Pipeline Design: Enhancing Operational Efficiency from Oracle to Single Store using Airflow S3 Data Pipelines

Towards proactive corrosion management: A predictive modeling approach in pipeline industrial applications

Embedded FPGA developments in 130 nm and 28 nm CMOS for machine learning in particle detector readout

A scalable and transparent data pipeline for AI-enabled health data ecosystems.

Distributed edge analytics in edge‐fog‐cloud continuum

Optimizing Data Pipeline Efficiency with Machine Learning Techniques

SmartSPIM Pipeline: A Scalable Cloud-Based Image Processing Pipeline for Light-sheet Microscopy Data

A Nursing Clinical Care Services Platform to Support Leadership Decision Making.

L-PBF High-Throughput Data Pipeline Approach for Multi-modal Integration

Enhancing transparency in public procurement: A data-driven analytics approach

BioPipeline Creator—a user-friendly Java-based GUI for managing and customizing biological data pipelines

High-Resolution Mass Spectrometry for Human Exposomics: Expanding Chemical Space Coverage.

Consensus for Operating Room Multimodal Data Management: Identifying Research Priorities for Data-Driven Surgery.

Ontology‐Based Data Acquisition, Refinement, and Utilization in the Development of a Multilayer Ferrite Inductor

Evaluating the Reliability of a Social Presence Composite Construct for Online Computer Science Degree Programmes

PHANGS-JWST: Data-processing Pipeline and First Full Public Data Release

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Data Pipeline Research Articles

Related Topics

Articles published on Data Pipeline

Challenges and practices of deep learning model reengineering: A case study on computer vision

An image processing pipeline for electron cryo-tomography in RELION-5.

A Comprehensive Online Analytical System Coupled with Standardized Data Analysis for the Electrochemical Reduction of CO2

Common sleep data pipeline for combined data sets.

Strategic Data Pipeline Design: Enhancing Operational Efficiency from Oracle to Single Store using Airflow S3 Data Pipelines

Towards proactive corrosion management: A predictive modeling approach in pipeline industrial applications

Embedded FPGA developments in 130 nm and 28 nm CMOS for machine learning in particle detector readout

A scalable and transparent data pipeline for AI-enabled health data ecosystems.

Distributed edge analytics in edge‐fog‐cloud continuum

Optimizing Data Pipeline Efficiency with Machine Learning Techniques

SmartSPIM Pipeline: A Scalable Cloud-Based Image Processing Pipeline for Light-sheet Microscopy Data

A Nursing Clinical Care Services Platform to Support Leadership Decision Making.

L-PBF High-Throughput Data Pipeline Approach for Multi-modal Integration

Enhancing transparency in public procurement: A data-driven analytics approach

BioPipeline Creator—a user-friendly Java-based GUI for managing and customizing biological data pipelines

High-Resolution Mass Spectrometry for Human Exposomics: Expanding Chemical Space Coverage.

Consensus for Operating Room Multimodal Data Management: Identifying Research Priorities for Data-Driven Surgery.

Ontology‐Based Data Acquisition, Refinement, and Utilization in the Development of a Multilayer Ferrite Inductor

Evaluating the Reliability of a Social Presence Composite Construct for Online Computer Science Degree Programmes

PHANGS-JWST: Data-processing Pipeline and First Full Public Data Release