Data-intensive Activities Research Articles

Differential privacy is at a turning point. Implementations have been successfully leveraged in private industry, the public sector, and academia in a wide variety of applications, allowing scientists, engineers, and researchers the ability to learn about populations of interest without specifically learning about these individuals. Because differential privacy allows us to quantify cumulative privacy loss, these differentially private systems will, for the first time, allow us to measure and compare the total privacy loss due to these personal data-intensive activities. Appropriately leveraged, this could be a watershed moment for privacy. Like other technologies and techniques that allow for a range of instantiations, implementation details matter. When meaningfully implemented, differential privacy supports deep data-driven insights with minimal worst-case privacy loss. When not meaningfully implemented, differential privacy delivers privacy mostly in name. Using differential privacy to maximize learning while providing a meaningful degree of privacy requires judicious choices with respect to the privacy parameter epsilon, among other factors. However, there is little understanding of what is the optimal value of epsilon for a given system or classes of systems/purposes/data etc. or how to go about figuring it out. To understand current differential privacy implementations and how organizations make these key choices in practice, we conducted interviews with practitioners to learn from their experiences of implementing differential privacy. We found no clear consensus on how to choose epsilon, nor is there agreement on how to approach this and other key implementation decisions. Given the importance of these implementation details there is a need for shared learning amongst the differential privacy community. To serve these purposes, we propose the creation of the Epsilon Registry—a publicly available communal body of knowledge about differential privacy implementations that can be used by various stakeholders to drive the identification and adoption of judicious differentially private implementations.

Read full abstract

Abstract Biopharmaceutical R&D organizations characterize drug candidate target effects and modes of action and create molecular models of target diseases. These data-intensive activities are informed by vast data resources including publicly available data, internally generated data and partnered private data collections. However, rapid evolution in computing, data management tools, analytical and visualization methods, the complexity of data types and the data volumes that must be accommodated present significant technical and logistic hurdles to overcome. It is particularly difficult for a geographically dispersed R&D organization to make data resources easily available to scientists for search, visualization and exploration. Nevertheless, this is required for R&D scientists to gain insight into disease and drug mechanisms and to capture the knowledge needed to sustain the scientific enterprise. Standardized commercial solutions to R&D data challenges are unattractive since they require significant resource investment in platform configuration, user-training and system maintenance. This strategy necessarily creates delay in adopting newly emerging technologies and provides incentive not to adopt alternatives due to investment in existing systems. In contrast, our solution to R&D data demands was to build a cloud-deployed data platform using state of the art tools developed and maintained by the open source software community at the Apache Software Foundation. Partnering with academic data scientists, we selected the best available tools to fit our specific needs. We integrated them into a platform accessible to our federated R&D scientific community while allowing the system to be freely modified and updated on demand to meet evolving user requirements. Priorities for our data platform are to ingest, secure and index R&D source data of all types, make these indexed data assets available to computational scientists for analysis and provide faceted search capability based on a comprehensive metadata model. Three products: LabKey server, Apache OODT and ISATools have all been combined into a scientific data management system to provide a unified data resource enhanced by a search platform powered by Apache Solr. The platform supports both internally generated data and data imported from public, contracted or partnered sources. All data are available for interactive exploration by our R&D community, accessed via integrated search, analysis and visualization tools. Deployment of this system to our R&D organization has been met with enthusiastic adoption. Feedback for improvement or requests for system enhancements and additional capabilities are rapidly addressed in this open source environment, leading to further adoption among the R&D scientists and providing the basis for accessible, stable institutional knowledge collections. Citation Format: Lauren Intagliata, Selina Chu, Garth McGrath, Giuseppe Totaro, Daniel Civello, Nipurn Doshi, Shivika Thapar, Michael Livstone, Chris Mattmann, Paul Ramirez, Maureen Cronin. A cloud-enabled open source data management platform supporting a federated research and development organization. [abstract]. In: Proceedings of the 107th Annual Meeting of the American Association for Cancer Research; 2016 Apr 16-20; New Orleans, LA. Philadelphia (PA): AACR; Cancer Res 2016;76(14 Suppl):Abstract nr 5282.

Read full abstract

Data-intensive Activities Research Articles

Articles published on Data-intensive Activities

Dynamic pricing on maximum concurrency for heterogeneous instances using hyperparameter optimization in dueling deep reinforcement learning in a multi-cloud scenario

Tensions in Data Journey Activities: Mobilising, Processing, Producing, and Re-purposing Data in Environmental Assessment Practice

Differential Privacy in Practice: Expose your Epsilons!

Big data and risk management in business processes: implications for corporate real estate

Abstract 5282: A cloud-enabled open source data management platform supporting a federated research and development organization

Common motifs in scientific workflows: An empirical analysis

Scripting for large-scale sequencing based on Hadoop

DZero data-intensive computing on the Open Science Grid

The Special Case of Pesticides: Science and Regulation

Mining a large database with a parallel database server

Research perspectives for time series management systems

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Data-intensive Activities Research Articles

Articles published on Data-intensive Activities

Dynamic pricing on maximum concurrency for heterogeneous instances using hyperparameter optimization in dueling deep reinforcement learning in a multi-cloud scenario

Tensions in Data Journey Activities: Mobilising, Processing, Producing, and Re-purposing Data in Environmental Assessment Practice

Differential Privacy in Practice: Expose your Epsilons!

Big data and risk management in business processes: implications for corporate real estate

Abstract 5282: A cloud-enabled open source data management platform supporting a federated research and development organization

Common motifs in scientific workflows: An empirical analysis

Scripting for large-scale sequencing based on Hadoop

DZero data-intensive computing on the Open Science Grid

The Special Case of Pesticides: Science and Regulation

Mining a large database with a parallel database server

Research perspectives for time series management systems