Distributed File System Research Articles

Introduction & BackgroundRADAR-Pipeline is an open-source Python framework designed to simplify and enhance mobile health data analysis. It has been designed to efficiently read and process the large amount of data generated through the RADAR-Base platform. RADAR-base is a scalable, real-time streaming and analytics open-source platform to facilitate research access and customisation requirements. Studies using the Radar-base platform have collected fine-grained longitudinal data from wearables and phones. The data can potentially create multitudes of digital biomarkers, which can be used to inform us greatly about the disease condition. Due to the sheer size of the data, it can be difficult for researchers to read and process those data -- a common task is identifying useful features and common data processing/analysis steps previously used by the community. Up to now, these have been hand-crafted by individual data scientists, often lacking the capability to be easily reused by the community without author-specific knowledge. Furthermore, generating variables based on already established research on a larger scale can be challenging and could hinder replication. Hence, we have designed RADAR-Pipeline to help researchers overcome these challenges. It empowers them to create and share their data analysis and visualisation pipelines, fostering collaboration and knowledge sharing within the research community. Objectives & ApproachThe primary objective of RADAR-Pipeline is to offer researchers a user-friendly and powerful platform to develop and share their research. Researchers can build reusable analysis and visualisation pipelines to ensure consistent and reliable results. It simplifies big data analysis by leveraging Apache Spark to handle large and complex mobile health datasets efficiently. Researchers can also save time and effort by reusing and extending existing pipelines built by others. Finally, the RADAR-Pipeline promotes collaboration and recognition by allowing researchers to share their work through the RADAR-base Analytics Catalogue, making their pipelines citable and accessible to the wider research community. Whilst Radar-pipeline has been designed to read data from Radar-base, it can also be used to read data from any dataset which uses Hadoop Distributed File System (HDFS) file system namespace. Relevance to Digital FootprintsMobile health data is rich and valuable for understanding human behaviour and health. RADAR-Pipeline addresses the challenges associated with analysing large and complex mobile health datasets, enabling researchers to extract valuable insights that can be used to (1) Improve public health: By enabling efficient analysis of large-scale mobile health data, RADAR-Pipeline can contribute to research efforts aimed at improving population health outcomes and developing effective interventions; (2) Personalised healthcare: By facilitating the extraction of individual-level features from mobile health data, RADAR-Pipeline can seamlessly be integrated with Kafka data streams and machine learning pipelines to process the data in real-time, which can then be utilised to create more effective and targeted real-time interventions. (3) Promote reproducible research: The framework's emphasis on transparency and reproducibility in research aligns with the conference's focus on the responsible use of digital mobile health data. Conclusions & ImplicationsRADAR-Pipeline is a valuable tool for researchers, offering them the means to harness the potential of mobile health data. By adopting this framework, researchers can achieve efficient and scalable data analysis, thereby streamlining the extracting insights from digital footprints. This efficiency enables researchers to delve deeper into the data and uncover valuable patterns and trends. Furthermore, RADAR-Pipeline promotes collaboration and knowledge sharing within the research community. By providing a standardised framework for data analysis, RADAR-Pipeline facilitates collaboration among researchers, leading to the sharing of best practices and the dissemination of knowledge.

Read full abstract

Anomaly Detection (AD) is an important area to reliably detect malicious behavior and attacks on computer systems. Log data is a rich source of information about systems and thus provides a suitable input for AD. With the sheer amount of log data available today, for years Machine Learning (ML) and more recently Deep Learning (DL) have been applied to create models for AD. Especially when processing complex log data, DL has shown some promising results in recent research to spot anomalies. It is necessary to group these log lines into log-event sequences, to detect anomalous patterns that span over multiple log lines. This work uses a centralized approach using a Long Short-Term Memory (LSTM) model for AD as its basis which is one of the most important approaches to represent long-range temporal dependencies in log-event sequences of arbitrary length. Therefore, we use past information to predict whether future events are normal or anomalous. For the LSTM model we adapt a state of the art open source implementation called LogDeep. For the evaluation, we use a Hadoop Distributed File System (HDFS) data set, which is well studied in current research. In this paper we show that without padding, which is a commonly used preprocessing step that strongly influences the AD process and artificially improves detection results and thus accuracy in lab testing, it is not possible to achieve the same high quality of results shown in literature. With the large quantity of log data, issues arise with the transfer of log data to a central entity where model computation can be done. Federated Learning (FL) tries to overcome this problem, by learning local models simultaneously on edge devices and overcome biases due to a lack of heterogeneity in training data through exchange of model parameters and finally arrive at a converging global model. Processing log data locally takes privacy and legal concerns into account, which could improve coordination and collaboration between researchers, cyber security companies, etc., in the future. Currently, there are only few scientific publications on log-based AD which use FL. Implementing FL gives the advantage of converging models even if the log data are heterogeneously distributed among participants as our results show. Furthermore, by varying individual LSTM model parameters, the results can be greatly improved. Further scientific research will be necessary to optimize FL approaches.

Read full abstract

Distributed File System Research Articles

Related Topics

Articles published on Distributed File System

An archive‐based method for efficiently handling small file problems in HDFS

KGFabric: A Scalable Knowledge Graph Warehouse for Enterprise Data Interconnection

IoT Forensic Cyber Activities Detection and Prevention with Automated Machine Learning Model

RADAR-Pipeline: Scalable Feature Generation for Mobile Health Data

An Efficient Approach to Extract and Store Big Semantic Web Data Using Hadoop and Apache Spark GraphX

Enhancing Hadoop distributed storage efficiency using multi-agent systems

Data Lake Conceptualized Web Platform for Food Research Data Collection

Fractional social optimization-based migration and replica management algorithm for load balancing in distributed file system for cloud computing

Aguhyper: a hyperledger-based electronic health record management framework

Kafka‐Shield: Kafka Streams‐based distributed detection scheme for IoT traffic‐based DDoS attacks

Tele-Trafficking of Virtual Data Storage Obtained from Smart Grid by Replicated Gluster in Syntose Environment

Survey on MapReduce Scheduler Algorithms in Hadoop Framework

Anomaly detection in log-event sequences: A federated deep learning approach and open challenges

A distributed data processing scheme based on Hadoop for synchrotron radiation experiments.

KS-SDN-DDoS: A Kafka streams-based real-time DDoS attack classification approach for SDN environment

Distributed File Systems for Collaborative Data Access and Scalability

Distributed matrix computing system for big data

DATA FINDING, SHARING AND DUPLICATION REMOVAL IN THE CLOUD

Design of storage benchmark kit framework for supporting the file storage retrieval

Strengthening data integrity in academic document recording with blockchain and InterPlanetary file system

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Distributed File System Research Articles

Related Topics

Articles published on Distributed File System

An archive‐based method for efficiently handling small file problems in HDFS

KGFabric: A Scalable Knowledge Graph Warehouse for Enterprise Data Interconnection

IoT Forensic Cyber Activities Detection and Prevention with Automated Machine Learning Model

RADAR-Pipeline: Scalable Feature Generation for Mobile Health Data

An Efficient Approach to Extract and Store Big Semantic Web Data Using Hadoop and Apache Spark GraphX

Enhancing Hadoop distributed storage efficiency using multi-agent systems

Data Lake Conceptualized Web Platform for Food Research Data Collection

Fractional social optimization-based migration and replica management algorithm for load balancing in distributed file system for cloud computing

Aguhyper: a hyperledger-based electronic health record management framework

Kafka‐Shield: Kafka Streams‐based distributed detection scheme for IoT traffic‐based DDoS attacks

Tele-Trafficking of Virtual Data Storage Obtained from Smart Grid by Replicated Gluster in Syntose Environment

Survey on MapReduce Scheduler Algorithms in Hadoop Framework

Anomaly detection in log-event sequences: A federated deep learning approach and open challenges

A distributed data processing scheme based on Hadoop for synchrotron radiation experiments.

KS-SDN-DDoS: A Kafka streams-based real-time DDoS attack classification approach for SDN environment

Distributed File Systems for Collaborative Data Access and Scalability

Distributed matrix computing system for big data

DATA FINDING, SHARING AND DUPLICATION REMOVAL IN THE CLOUD

Design of storage benchmark kit framework for supporting the file storage retrieval

Strengthening data integrity in academic document recording with blockchain and InterPlanetary file system