Abstract
A standard assumption of the theory of machine learning is the data are generated from a fixed but unknown probability distribution. Although this assumption is based on the foundations of the theory of probability, however, for most learning problems we usually technically random shuffle the original datasets, such as random split into training and test datasets before the training model, to satisfy the assumption, and then we use the shuffled training dataset to train a machine learning model. However, honestly, for real-life learning applications, the data pairs are observed batch by batch under their own original order and it is not necessary to randomly shuffle the original order in advance. From a mathematical point of view, we test if the random shuffling will play a non-negligible influence on the generalization of learning machines. We reduce the problem of random shuffling into the problem of distribution-shift detection.This paper is devoted to testing the null hypothesis that random shuffling does not affect the generalization of learning machines and introduces a distributions-free martingales method against the null hypothesis. We report the five real-life benchmarks of experimental performance with the help of Support Vector Machines and a multi-layer perceptron model. The results show a bonafide fact that the distribution shift in itself of the data is an inescapable reality when we build machine learning algorithms as the original order.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.