Data Pipeline Research Articles

With the rapid increase in the number of Internet-of-Things (IoT) devices, the massive volume of data creates significant challenges for traditional cloud-based solutions. These solutions often lead to high latency, increased operational costs, and limited scalability, making them unsuitable for real-time applications and resource-constrained environments. As a result, edge, and fog computing have emerged as viable alternatives, reducing latency and costs by processing data closer to its source. However, managing the flow of such vast and distributed data streams requires well-structured data pipelines to control the complete lifecycle—from data acquisition at the source to processing at the edge and fog layers, and finally storage and analytics in the cloud. To dynamically handle data analytics at varying distances from the source, often on heterogeneous hardware devices with collaborative learning techniques such as Federated Learning (FL). FL enables decentralized model training by leveraging the local data on Edge Devices (EDs), thereby preserving data privacy and reducing communication overhead with the cloud. However, FL faces critical challenges, including data heterogeneity, where the non-independent and identically distributed (non-IID) nature of data degrades model performance, and resource limitations on EDs, which lead to inefficiencies in training and biases in the aggregated models.To address these issues, we propose a novel FL solution, called Pop-Up Federated Learning (PopFL) in edge networks. This solution introduces hierarchical aggregation to reduce network congestion by distributing the aggregation tasks across multiple Fog Servers (FSs), rather than relying solely on centralized cloud aggregation. To further enhance participation and resource utilization at the edge, we incorporate the Stackelberg game model, which incentivizes EDs based on their contribution and resource availability. Additionally, PopFL employs a pop-up ad-hoc network for scalable and efficient communication between EDs and FSs, ensuring robust data transmission in dynamic network conditions. Extensive experiments conducted on three diverse datasets highlight the superior performance of PopFL compared to state-of-the-art FL techniques. The results show significant improvements in model accuracy, robustness, and fairness across various scenarios, effectively addressing the challenges of data heterogeneity and resource limitations. Through these innovations, PopFL paves the way for more reliable and efficient distributed learning systems, unlocking the full potential of FL in real-world applications where low latency and scalable solutions are critical.

Read full abstract

Biological Evaluations support Endangered Species Act (ESA) consultation with the US Fish and Wildlife Service and National Marine Fisheries Service by federal action agencies, such as the USEPA, regarding impacts of federal activities on threatened or endangered species. However, they are often time-consuming and challenging to conduct. The identification of pollutant benchmarks or guidance to protect taxa for states and tribes when USEPA has not yet developed criteria recommendations is also of importance to ensure a streamlined approach to Clean Water Act program implementation. Due to substantial workloads, tight regulatory timelines, and the often-protracted length of ESA consultations, there is a need to streamline the development of biological evaluation toxicity assessments for determining the impact of chemical pollutants on ESA-listed species. Moreover, there is limited availability of species-specific toxicity data for many contaminants, further complicating the consultation process. New approach methodologies are being increasingly used in toxicology and chemical safety assessment to rapidly and cost-effectively provide data that can fill gaps in hazard and/or exposure characterization. Here, we present the development of an automated computational pipeline-RASRTox (Rapidly Acquire, Score, and Rank Toxicological data)-to rapidly extract and categorize ecological toxicity benchmark values from curated data sources (ECOTOX, ToxCast) and well-established quantitative structure-activity relationships (TEST, ECOSAR). As a proof of concept, points-of-departure (PODs) generated in RASRTox for 13 chemicals were compared against benchmark values derived using traditional methods-toxicity reference values (TRVs) and water quality criteria (WQC). The RASRTox PODs were generally within an order of magnitude of corresponding TRVs, though less concordant compared with WQC. The greatest utility of RASRTox, however, lies in its ability to quickly and systematically identify critical studies that may serve as a basis for screening value derivation by toxicologists as part of an ecological hazard assessment. As such, the strategy described in this case study can potentially be adapted for other risk assessment contexts and stakeholder needs. Integr Environ Assess Manag 2024;20:2203-2217. © 2024 Society of Environmental Toxicology & Chemistry (SETAC). This article has been contributed to by U.S. Government employees and their work is in the public domain in the USA.

Read full abstract

Data Pipeline Research Articles

Related Topics

Articles published on Data Pipeline

Reinforcement Learning for Intra- & Inter-Node Recommender Data Pipeline Optimization

From data to nutrition: the impact of computing infrastructure and artificial intelligence

PopFL: A scalable Federated Learning model in serverless edge computing integrating with dynamic pop-up network

Secure Supply Chain Information Interchange using Distributed Trust Backbone

Designing secure data pipelines for medical billing fraud detection using homomorphic encryption and federated learning

Painting the Future: Leveraging Generative AI's Power in the Financial Ecosystem

Hadoop Ecosystem Enhances Data Analytics for Music Streaming: A Case Study of User Behavior in the Last FM Dataset

Systematic construction of composite radiation therapy dataset using automated data pipeline for prognosis prediction

Next-Generation Data Pipeline Designs for Modern Analytics : A Comprehensive Review

Processing, evaluating, and understanding FMRI data with afni_proc.py.

How to Staff When Customers Arrive in Batches

A Novel Spatial Data Pipeline for Streaming Smart City Data

Event-Driven Data Pipelines : A Cloud-Based Approach to Real-Time Data Processing

Research on oil and gas pipeline leak detection method based on 1DCNN-DBO-LSTM

"I inherently just trust that it works": Investigating Mental Models of Open-Source Libraries for Differential Privacy

Advancing Fraud Detection in Banking: Integration of Data Pipelines, Machine Learning, and Cloud Computing

An intelligent monitoring approach for urban natural gas pipeline leak using semi-supervised learning generative adversarial networks

An automated computational data pipeline to rapidly acquire, score, and rank toxicological data for ecological hazard assessment.

Mastering Data Pipelines for Al: A Beginner's Guide to Building Efficient Workflows

Advantages of Big Data Analysis and AI Technology in Data Collecting And Processing of Preventive Maintenance

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Data Pipeline Research Articles

Related Topics

Articles published on Data Pipeline

Reinforcement Learning for Intra- &amp; Inter-Node Recommender Data Pipeline Optimization

From data to nutrition: the impact of computing infrastructure and artificial intelligence

PopFL: A scalable Federated Learning model in serverless edge computing integrating with dynamic pop-up network

Secure Supply Chain Information Interchange using Distributed Trust Backbone

Designing secure data pipelines for medical billing fraud detection using homomorphic encryption and federated learning

Painting the Future: Leveraging Generative AI's Power in the Financial Ecosystem

Hadoop Ecosystem Enhances Data Analytics for Music Streaming: A Case Study of User Behavior in the Last FM Dataset

Systematic construction of composite radiation therapy dataset using automated data pipeline for prognosis prediction

Next-Generation Data Pipeline Designs for Modern Analytics : A Comprehensive Review

Processing, evaluating, and understanding FMRI data with afni_proc.py.

How to Staff When Customers Arrive in Batches

A Novel Spatial Data Pipeline for Streaming Smart City Data

Event-Driven Data Pipelines : A Cloud-Based Approach to Real-Time Data Processing

Research on oil and gas pipeline leak detection method based on 1DCNN-DBO-LSTM

"I inherently just trust that it works": Investigating Mental Models of Open-Source Libraries for Differential Privacy

Advancing Fraud Detection in Banking: Integration of Data Pipelines, Machine Learning, and Cloud Computing

An intelligent monitoring approach for urban natural gas pipeline leak using semi-supervised learning generative adversarial networks

An automated computational data pipeline to rapidly acquire, score, and rank toxicological data for ecological hazard assessment.

Mastering Data Pipelines for Al: A Beginner's Guide to Building Efficient Workflows

Advantages of Big Data Analysis and AI Technology in Data Collecting And Processing of Preventive Maintenance

Reinforcement Learning for Intra- & Inter-Node Recommender Data Pipeline Optimization