Abstract

Machine learning benchmark data sets come in all shapes and sizes, whereas classification algorithms assume sanitized input, such as (x, y) pairs with vector-valued input x and integer class label y. Researchers and practitioners know all too well how tedious it can be to get from the URL of a new data set to a NumPy ndarray suitable for e.g. pandas or sklearn. The SkData library handles that work for a growing number of benchmark data sets (small and large) so that one-off in-house scripts for downloading and parsing data sets can be replaced with library code that is reliable, community-tested, and documented. The SkData library also introduces an open-ended formalization of training and testing protocols that facilitates direct comparison with published research. This paper describes the usage and architecture of the SkData library. Index Terms—machine learning, cross validation, reproducibility While the neatness of these mathematical abstractions is reflected in the organization of machine learning libraries such as (sklearn), we believe there is a gap in Python's machine learning stack between raw data sets and such neat, abstract interfaces. Data, even when it is provided specifically to test classification algorithms, is seldom provided as (feature, label) pairs. Guidelines regarding standard experiment protocols (e.g. which data to use for training) are expressed informally in web page text if at all. The SkData library consolidates myriad little details of idiosyncratic data processing required to run experiments on standard data sets, and packages them as a library of reusable code. It serves as both a gateway to access a growing list of standard public data sets, and as a framework for expressing precise evaluation protocols that correspond to standard ways of using those data sets. This paper introduces the SkData library ((SkData)) for accessing data sets in Python. SkData provides two levels of interface: 1. It provides low-level idiosyncratic logic for acquir- ing, unpacking, and parsing standard data sets so that they can be loaded into sensible Python data structures.

Highlights

  • There is nothing standard about data sets for machine learning

  • This paper introduces the SkData library ([SkData]) for accessing data sets in Python

  • 2) It provides high-level logic for evaluating machine learning algorithms using strictly controlled experiment protocols, so that it is easy to make direct, valid model comparisons

Read more

Summary

Introduction

There is nothing standard about data sets for machine learning. The nature of data sets varies widely, from physical measurements of flower petals ([Iris]), to pixel values of tiny public domain images ([CIFAR-10]), to the movie watching habits of NetFlix users ([Netflix]). 2) It provides high-level logic for evaluating machine learning algorithms using strictly controlled experiment protocols, so that it is easy to make direct, valid model comparisons These interfaces are provided on a data-set-by-data-set basis. There is a convention that this low-level logic for each data (e.g. foo) should be written in a Python file called skdata.foo.dataset. Users who want a head start in getting Python access to downloaded data are well-served by the low-level modules, but users who want a framework to help them reproduce previous machine learning results by following specific experiment protocols will be more interested in using SkData’s higher-level view interface. The few sections describe the high-level protocol abstractions provided by SkData’s various data set-specific view modules

Background
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call