The principal support vector machines method (Li et al., 2011) is a powerful tool for sufficient dimension reduction that replaces original predictors with their low-dimensional linear combinations while preserving the information for regression and classification. However, the computational burden of the principal support vector machines method constrains its use for massive data. To address this issue, we propose a naive and a refined distributed estimation algorithms for fast implementation when the sample size is large. Both distributed sufficient dimension reduction estimators exhibit the same statistical efficiency as when all the data is merged together, which provides rigorous statistical guarantees for their application to large-scale datasets, while the refined method requires smaller batch sample sizes and hence is more advantageous when memory limitations exist on distributed machines. The two distributed algorithms are further adapted to principal weighted support vector machines (Shin et al., 2017) for sufficient dimension reduction in binary classification. The statistical accuracy and computational complexity of our proposed methods are examined through comprehensive simulation studies and in a real data application with more than 600000 samples.
Read full abstract