Lightweight data management with dtool.

Tjelvar S G Olsson,Matthew Hartley

doi:10.7717/peerj.6562

Abstract

The explosion in volumes and types of data has led to substantial challenges in data management. These challenges are often faced by front-line researchers who are already dealing with rapidly changing technologies and have limited time to devote to data management. There are good high-level guidelines for managing and processing scientific data. However, there is a lack of simple, practical tools to implement these guidelines. This is particularly problematic in a highly distributed research environment where needs differ substantially from group to group and centralised solutions are difficult to implement and storage technologies change rapidly. To meet these challenges we have developed dtool, a command line tool for managing data. The tool packages data and metadata into a unified whole, which we call a dataset. The dataset provides consistency checking and the ability to access metadata for both the whole dataset and individual files. The tool can store these datasets on several different storage systems, including a traditional file system, object store (S3 and Azure) and iRODS. It includes an application programming interface that can be used to incorporate it into existing pipelines and workflows. The tool has provided substantial process, cost, and peace-of-mind benefits to our data management practices and we want to share these benefits. The tool is open source and available freely online at http://dtool.readthedocs.io.

Highlights

Science is an empirical discipline and requires careful data management
Advances in our ability to capture and store data have resulted in a ‘‘big data explosion’’. This is true in biology and has resulted in data management becoming one of the big challenges faced by the biological sciences (Howe et al, 2008; Stephens et al, 2015; Cook et al, 2018)
It consists of a command line tool and an application programming interface (API) for packaging and interacting with data

Summary

Introduction

Science is an empirical discipline and requires careful data management. Advances in our ability to capture and store data have resulted in a ‘‘big data explosion’’. This is true in biology and has resulted in data management becoming one of the big challenges faced by the biological sciences (Howe et al, 2008; Stephens et al, 2015; Cook et al, 2018). Funders and the research community care about data being trusted, shared and reusable (Vision, 2010; Wilkinson et al, 2016; Waard, Cousijn & Aalbersberg, 2018, Leek, 2018). At the ground level individual researchers need to think about how to structure their data into files, how these data files are to be organised and how to associate metadata with these data files (Hart et al, 2016; Wickham, 2014; Leek, 2018)

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PeerJ	Publication Date: Mar 7, 2019
Citations: 7	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Lightweight data management with dtool.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PeerJ

Lead the way for us

Similar Papers

Advantages and potential challenges of data management in e-maintenance
Arian Razmi-Farooji ... Janne Härkönen
Journal of Quality in Maintenance Engineering | VOL. 25
Arian Razmi-Farooji, et. al.Arian Razmi-Farooji ... Janne Härkönen
01 May 2019
Journal of Quality in Maintenance Engineering | VOL. 25

Primary Healthcare Data Management Practice and Associated Factors: The Case of Health Extension Workers in Northwest Ethiopia
Segenet Yitayew ... Mulusew A Asemahagn
The Open Medical Informatics Journal | VOL. 13
Segenet Yitayew, et. al.Segenet Yitayew ... Mulusew A Asemahagn
24 Jul 2019
The Open Medical Informatics Journal | VOL. 13

Health Data Management Practice and Associated Factors Among Health Professionals Working at Public Health Facilities in Resource Limited Settings.
Habtamu Setegn Ngusie ... Atsede Mazengia Shiferaw
Advances in Medical Education and Practice | VOL. 12
Habtamu Setegn Ngusie, et. al.Habtamu Setegn Ngusie ... Atsede Mazengia Shiferaw
01 Aug 2021
Advances in Medical Education and Practice | VOL. 12

Partnering with health sciences libraries to address challenges in bioimaging data management and sharing.
Christie Silkotch ... Rolando Garcia-Milian
Histochemistry and cell biology | VOL. 160
Christie Silkotch, et. al.Christie Silkotch ... Rolando Garcia-Milian
29 May 2023
Histochemistry and cell biology | VOL. 160

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Lightweight data management with dtool.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PeerJ