Abstract

Data science teams often collaboratively analyze datasets, generating dataset versions at each stage of iterative exploration and analysis. There is a pressing need for a system that can support dataset versioning, enabling such teams to efficiently store, track, and query across dataset versions. We introduce OrpheusDB, a dataset version control system that “bolts on” versioning capabilities to a traditional relational database system, thereby gaining the analytics capabilities of the database “for free.” We develop and evaluate multiple data models for representing versioned data, as well as a lightweight partitioning scheme, LyreSplit, to further optimize the models for reduced query latencies. With LyreSplit, OrpheusDB is on average $$10^3\times $$ faster in finding effective (and better) partitionings than competing approaches, while also reducing the latency of version retrieval by up to $$20\times $$ relative to schemes without partitioning. LyreSplit can be applied in an online fashion as new versions are added, alongside an intelligent migration scheme that reduces migration time by $$10\times $$ on average.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call