Abstract

Most machine learning algorithms need to handle large data sets. This feature often leads to limitations on processing time and memory. The Expectation-Maximization (EM) is one of such algorithms, which is used to train one of the most commonly used parametric statistical models, the Gaussian Mixture Models (GMM). All steps of the algorithm are potentially parallelizable once they iterate over the entire data set. In this study, we propose a parallel implementation of EM for training GMM using CUDA. Experiments are performed with a UCI dataset and results show a speedup of 7 if compared to the sequential version. We have also carried out modifications to the code in order to provide better access to global memory and shared memory usage. We have achieved up to 56.4% of achieved occupancy, regardless the number of Gaussians considered in the set of experiments.

Highlights

  • The clear advantage of using Graphical Processing Units (GPUs) is the small costs if compared to clusters or supercomputers and its Machine Learning (ML) algorithms are often costly, since learning is a task that requires a large amount of knowledge and constant improvement of it, requiring massive data computation

  • To the approach of (Machlica et al, 2011) and (Kumar et al, 2009), in our proposal the main loop of the algorithm is implemented sequentially and different CUDA kernels are in charge of running different steps of the algorithm

  • We have used the dataset Arabic Spoken Digit 3 from UCI Repository in order to test the algorithm implementation. This dataset consists of instances with 13 Mel Frequency Cepstral Coefficients (MFCC), widely used to represent audio signals in speech processing systems, which commonly use Gaussian Mixture Models (GMM) to model the distribution of phones in the language

Read more

Summary

Introduction

The clear advantage of using GPUs is the small costs if compared to clusters or supercomputers and its Machine Learning (ML) algorithms are often costly, since learning is a task that requires a large amount of knowledge and constant improvement of it, requiring massive data computation. A major problem of massive computing is the limitation of mainstream processing power if compared to multi-core processors. Even the former NVIDIATM GeForceTM 8400 GS graphics card, for instance, is able to run up to 32 threads in parallel per clock cycle, under some restrictions. Such limitation can be overcome using a parallel domains can be observed in recent scientific literature processing of data provided on newer architectures. One of these recent architecture is the NVIDIATM. It is possible to use the CUDA-C programming language, for instance, to provide a parallelized source code

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.