Abstract
Most machine learning algorithms need to handle large data sets. This feature often leads to limitations on processing time and memory. The Expectation-Maximization (EM) is one of such algorithms, which is used to train one of the most commonly used parametric statistical models, the Gaussian Mixture Models (GMM). All steps of the algorithm are potentially parallelizable once they iterate over the entire data set. In this study, we propose a parallel implementation of EM for training GMM using CUDA. Experiments are performed with a UCI dataset and results show a speedup of 7 if compared to the sequential version. We have also carried out modifications to the code in order to provide better access to global memory and shared memory usage. We have achieved up to 56.4% of achieved occupancy, regardless the number of Gaussians considered in the set of experiments.
Highlights
The clear advantage of using Graphical Processing Units (GPUs) is the small costs if compared to clusters or supercomputers and its Machine Learning (ML) algorithms are often costly, since learning is a task that requires a large amount of knowledge and constant improvement of it, requiring massive data computation
To the approach of (Machlica et al, 2011) and (Kumar et al, 2009), in our proposal the main loop of the algorithm is implemented sequentially and different CUDA kernels are in charge of running different steps of the algorithm
We have used the dataset Arabic Spoken Digit 3 from UCI Repository in order to test the algorithm implementation. This dataset consists of instances with 13 Mel Frequency Cepstral Coefficients (MFCC), widely used to represent audio signals in speech processing systems, which commonly use Gaussian Mixture Models (GMM) to model the distribution of phones in the language
Summary
The clear advantage of using GPUs is the small costs if compared to clusters or supercomputers and its Machine Learning (ML) algorithms are often costly, since learning is a task that requires a large amount of knowledge and constant improvement of it, requiring massive data computation. A major problem of massive computing is the limitation of mainstream processing power if compared to multi-core processors. Even the former NVIDIATM GeForceTM 8400 GS graphics card, for instance, is able to run up to 32 threads in parallel per clock cycle, under some restrictions. Such limitation can be overcome using a parallel domains can be observed in recent scientific literature processing of data provided on newer architectures. One of these recent architecture is the NVIDIATM. It is possible to use the CUDA-C programming language, for instance, to provide a parallelized source code
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.