Computing minimum-volume enclosing ellipsoids for large datasets

Samuel Rosa,Radoslav Harman

doi:10.1016/j.csda.2022.107452

Abstract

An algorithm is provided for calculating the minimum-volume enclosing ellipsoid (MVEE) for a large dataset stored in a separate database, for which the existing algorithms run out of memory or become prohibitively slow. The focus is on tall datasets, i.e., those consisting of huge numbers of data points of moderate dimensionality. The proposed Big Index Batching algorithm works in an optimization-deletion-adaptation cycle that consists of: using an existing algorithm by applying it on a smaller batch of data; pruning the vector of the indices of the data points by removing the points that are guaranteed to not lie on the boundary of the MVEE; and efficiently adapting the choice of the batch. The algorithm is provably convergent, and simple to describe and implement. The reading of tall data from the database is very time consuming, therefore the amount of reading during an MVEE computation should be as small as possible. It is shown on examples that Big Index Batching tends to find the MVEE after reading all data points just two or three times. As a consequence, the proposed algorithm usually converges to the MVEE reasonably fast. Its usefulness in robust statistics and anomaly detection is demonstrated by finding the potential outliers in a large dataset by using so-called ellipsoidal trimming.

Full Text