Fast probabilistic file fingerprinting for big data

Konstantin Tretyakov,Jaak Vilo,Sven Laur,Pjotr Prins,Geert Smant

doi:10.1186/1471-2164-14-s2-s8

Konstantin Tretyakov, Jaak Vilo + Show 3 more

Open Access

https://doi.org/10.1186/1471-2164-14-s2-s8

Copy DOI

Journal: BMC Genomics	Publication Date: Feb 1, 2013
Citations: 17	License type: CC BY 2.0

Affiliation: University of Tartu, University of Groningen

Abstract

BackgroundBiological data acquisition is raising new challenges, both in data analysis and handling. Not only is it proving hard to analyze the data at the rate it is generated today, but simply reading and transferring data files can be prohibitively slow due to their size. This primarily concerns logistics within and between data centers, but is also important for workstation users in the analysis phase. Common usage patterns, such as comparing and transferring files, are proving computationally expensive and are tying down shared resources.ResultsWe present an efficient method for calculating file uniqueness for large scientific data files, that takes less computational effort than existing techniques. This method, called Probabilistic Fast File Fingerprinting (PFFF), exploits the variation present in biological data and computes file fingerprints by sampling randomly from the file instead of reading it in full. Consequently, it has a flat performance characteristic, correlated with data variation rather than file size. We demonstrate that probabilistic fingerprinting can be as reliable as existing hashing techniques, with provably negligible risk of collisions. We measure the performance of the algorithm on a number of data storage and access technologies, identifying its strengths as well as limitations.ConclusionsProbabilistic fingerprinting may significantly reduce the use of computational resources when comparing very large files. Utilisation of probabilistic fingerprinting techniques can increase the speed of common file-related workflows, both in the data center and for workbench analysis. The implementation of the algorithm is available as an open-source tool named pfff, as a command-line tool as well as a C library. The tool can be downloaded from http://biit.cs.ut.ee/pfff.

Highlights

Biological data acquisition is raising new challenges, both in data analysis and handling
This hash value is a number, typically sized between 64 and 2048 bits, such that the chances of obtaining the same hash value for two distinct files - the situation referred to as a collision - are negligibly small, on the order of 2-64 to 2-2048. This way, hash values act like a fingerprinting technique, which allows to compare files at two sites without transferring them
We offer a new hashing algorithm, Probabilistic Fast File Fingerprinting (PFFF), that computes file fingerprints by sampling only a few bytes from the file in a pseudorandom fashion

Summary

Introduction

Biological data acquisition is raising new challenges, both in data analysis and handling. Is it proving hard to analyze the data at the rate it is generated today, but reading and transferring data files can be prohibitively slow due to their size This primarily concerns logistics within and between data centers, but is important for workstation users in the analysis phase. This hash value is a number, typically sized between 64 and 2048 bits, such that the chances of obtaining the same hash value for two distinct files - the situation referred to as a collision - are negligibly small, on the order of 2-64 to 2-2048 This way, hash values act like a fingerprinting technique, which allows to compare files at two sites without transferring them

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Fast probabilistic file fingerprinting for big data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics

Lead the way for us

Similar Papers

A CACHE DESIGN METHOD FOR SPATIAL INFORMATION VISUALIZATION IN 3D REAL-TIME RENDERING ENGINE
X Dai ... H Xiong
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences | VOL. XXXIX-B2
X Dai, et. al.X Dai ... H Xiong
27 Jul 2012
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences | VOL. XXXIX-B2

Command-Line Tools in Linux for Handling Large Data Files
Deepti Mishra ... Garima Khandelwal
-
Deepti Mishra, et. al.Deepti Mishra ... Garima Khandelwal
01 Jan 2018
01 Jan 2018

Data preview software for interactive review of Holter type plantar pressure data
R.A Wervey ... J.J Wertsch
-
R.A Wervey, et. al.R.A Wervey ... J.J Wertsch
20 Sep 1995
20 Sep 1995

Localization by Wireless Technologies for Managing of Large Scale Data Artifacts on Mobile Devices
Ondrej Krejcar
-
Ondrej KrejcarOndrej Krejcar
01 Jan 2009
01 Jan 2009

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Fast probabilistic file fingerprinting for big data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics