Abstract

An algorithm is provided for calculating the minimum-volume enclosing ellipsoid (MVEE) for a large dataset stored in a separate database, for which the existing algorithms run out of memory or become prohibitively slow. The focus is on tall datasets, i.e., those consisting of huge numbers of data points of moderate dimensionality. The proposed Big Index Batching algorithm works in an optimization-deletion-adaptation cycle that consists of: using an existing algorithm by applying it on a smaller batch of data; pruning the vector of the indices of the data points by removing the points that are guaranteed to not lie on the boundary of the MVEE; and efficiently adapting the choice of the batch. The algorithm is provably convergent, and simple to describe and implement. The reading of tall data from the database is very time consuming, therefore the amount of reading during an MVEE computation should be as small as possible. It is shown on examples that Big Index Batching tends to find the MVEE after reading all data points just two or three times. As a consequence, the proposed algorithm usually converges to the MVEE reasonably fast. Its usefulness in robust statistics and anomaly detection is demonstrated by finding the potential outliers in a large dataset by using so-called ellipsoidal trimming.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.