Large-scale Data Analysis Research Articles

The typical hypothesis testing issue in statistical analysis is determining whether a pattern is significantly associated with a specific class label. This usually leads to highly challenging multiple-hypothesis testing problems in big data mining scenarios, as millions or billions of hypothesis tests in large-scale exploratory data analysis can result in a large number of false positive results. The permutation testing-based FWER control method (PFWER) is theoretically effective in dealing with multiple hypothesis testing issues. In reality, however, this theoretical approach confronts a serious computational efficiency problem. It takes an extremely long time to compute an appropriate FWER false positive control threshold using PFWER, which is almost impossible to achieve in a reasonable amount of time using human effort on medium- or large-scale data. Although some methods for improving the efficiency of the FWER false positive control threshold calculation have been proposed, most of them are stand-alone, and there is still a lot of space for efficiency improvement. To address this problem, this paper proposes a distributed PFWER false-positive threshold calculation method for large-scale data. The computational effectiveness increases significantly when compared to the current approaches. The FP-growth algorithm is used first for pattern mining, and the mining process reduces the computation of invalid patterns by using pruning operations and index optimization for merging patterns with index transactions. The distributed computing technique is introduced on this basis, and the constructed FP tree is decomposed into a set of subtrees, each corresponding to a subtask. All subtrees (subtasks) are distributed to different computing nodes. Each node independently calculates the local significance threshold according to the designated subtasks. Finally, all local results are aggregated to compute the FWER false positive control threshold, which is completely consistent with the theoretical result. A series of experimental findings on 11 real-world datasets demonstrate that the distributed algorithm proposed in this paper can significantly improve the computation efficiency of PFWER while ensuring its theoretical accuracy.

Read full abstract

Since the outset of the COVID-19 pandemic, substantial public attention has focused on the role of seasonality in impacting transmission. Misconceptions have relied on seasonal mediation of respiratory diseases driven solely by environmental variables. However, seasonality is expected to be driven by host social behavior, particularly in highly susceptible populations. A key gap in understanding the role of social behavior in respiratory disease seasonality is our incomplete understanding of the seasonality of indoor human activity. We leverage a novel data stream on human mobility to characterize activity in indoor versus outdoor environments in the United States. We use an observational mobile app-based location dataset encompassing over 5 million locations nationally. We classify locations as primarily indoor (e.g. stores, offices) or outdoor (e.g. playgrounds, farmers markets), disentangling location-specific visits into indoor and outdoor, to arrive at a fine-scale measure of indoor to outdoor human activity across time and space. We find the proportion of indoor to outdoor activity during a baseline year is seasonal, peaking in winter months. The measure displays a latitudinal gradient with stronger seasonality at northern latitudes and an additional summer peak in southern latitudes. We statistically fit this baseline indoor-outdoor activity measure to inform the incorporation of this complex empirical pattern into infectious disease dynamic models. However, we find that the disruption of the COVID-19 pandemic caused these patterns to shift significantly from baseline and the empirical patterns are necessary to predict spatiotemporal heterogeneity in disease dynamics. Our work empirically characterizes, for the first time, the seasonality of human social behavior at a large scale with a high spatiotemporal resolutio and provides a parsimonious parameterization of seasonal behavior that can be included in infectious disease dynamics models. We provide critical evidence and methods necessary to inform the public health of seasonal and pandemic respiratory pathogens and improve our understanding of the relationship between the physical environment and infection risk in the context of global change. Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award number R01GM123007.

Read full abstract

Large-scale Data Analysis Research Articles

Related Topics

Articles published on Large-scale Data Analysis

Large-Scale Estimation and Analysis of Web Users' Mood from Web Search Query and Mobile Sensor Data.

Middle Meningeal Artery Embolization in Adjunction to Surgical Evacuation for Treatment of Subdural Hematomas: A Nationwide Comparison of Outcomes With Isolated Surgical Evacuation.

Large-scale real-world data analyses of cancer risks among patients with rheumatoid arthritis.

Performance models of data parallel DAG workflows for large scale data analytics

A mathematical programming approach for resource allocation of data analysis workflows on heterogeneous clusters

Accurate Label-Free Quantification by directLFQ to Compare Unlimited Numbers of Proteomes

Accurate drusen segmentation in optical coherence tomography via order-constrained regression of retinal layer heights

Comparative transcriptome analysis reveals the core molecular network in pattern-triggered immunity in Sorghum bicolor

Fathers in Europe: Policies, constructions and practices. Introduction to the Special Collection

Massive Parallel Alignment of RNA-seq Reads in Serverless Computing

Curating Training Data for Reliable Large-Scale Visual Data Analysis: Lessons from Identifying Trash in Street View Imagery

Organizational Agility and Communicative Actions for Responsible Innovation: Evidence from manufacturing firms in South Korea

PLDH: Pseudo-Labels Based Deep Hashing

An empirical analysis of the impact of semiconductor engineer characteristics on outflows and inflows: evidence from six major semiconductor countries

PANGEA: a new gene set enrichment tool for Drosophila and common research organisms.

Approaches and tools for user-driven provenance and data quality information in spatial data infrastructures

A national transgender health survey from China assessing gender identity conversion practice, mental health, substance use and suicidality

Achieving high yield and nitrogen agronomic efficiency by coupling wheat varieties with soil fertility

Efficient False Positive Control Algorithms in Big Data Mining

Disentangling the rhythms of human activity in the built environment for airborne transmission risk: An analysis of large-scale mobility data.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Large-scale Data Analysis Research Articles

Related Topics

Articles published on Large-scale Data Analysis

Large-Scale Estimation and Analysis of Web Users' Mood from Web Search Query and Mobile Sensor Data.

Middle Meningeal Artery Embolization in Adjunction to Surgical Evacuation for Treatment of Subdural Hematomas: A Nationwide Comparison of Outcomes With Isolated Surgical Evacuation.

Large-scale real-world data analyses of cancer risks among patients with rheumatoid arthritis.

Performance models of data parallel DAG workflows for large scale data analytics

A mathematical programming approach for resource allocation of data analysis workflows on heterogeneous clusters

Accurate Label-Free Quantification by directLFQ to Compare Unlimited Numbers of Proteomes

Accurate drusen segmentation in optical coherence tomography via order-constrained regression of retinal layer heights

Comparative transcriptome analysis reveals the core molecular network in pattern-triggered immunity in Sorghum bicolor

Fathers in Europe: Policies, constructions and practices. Introduction to the Special Collection

Massive Parallel Alignment of RNA-seq Reads in Serverless Computing

Curating Training Data for Reliable Large-Scale Visual Data Analysis: Lessons from Identifying Trash in Street View Imagery

Organizational Agility and Communicative Actions for Responsible Innovation: Evidence from manufacturing firms in South Korea

PLDH: Pseudo-Labels Based Deep Hashing

An empirical analysis of the impact of semiconductor engineer characteristics on outflows and inflows: evidence from six major semiconductor countries

PANGEA: a new gene set enrichment tool for Drosophila and common research organisms.

Approaches and tools for user-driven provenance and data quality information in spatial data infrastructures

A national transgender health survey from China assessing gender identity conversion practice, mental health, substance use and suicidality

Achieving high yield and nitrogen agronomic efficiency by coupling wheat varieties with soil fertility

Efficient False Positive Control Algorithms in Big Data Mining

Disentangling the rhythms of human activity in the built environment for airborne transmission risk: An analysis of large-scale mobility data.