Proceedings of the Third Workshop on Large Scale Data Mining: Theory and Applications

doi:10.1145/2002945

Abstract

With advances in data collection and storage technologies, large data sources have become ubiquitous. Today, organizations routinely collect terabytes of data on a daily basis with the intent of gleaning non-trivial insights on their business processes. To benefit from these advances, it is imperative that data mining and machine learning techniques scale to such proportions. Such scaling can be achieved through the design of new and faster algorithms and/or through the employment of parallelism. Furthermore, it is important to note that emerging and future processor architectures (like multi-cores) will rely on user-specified parallelism to provide any performance gains. Unfortunately, achieving such scaling is non-trivial and only a handful of research efforts in the data mining and machine learning communities have attempted to address these scales. At the other end of the spectrum, the past few years have witnessed the emergence of several platforms for the implementation and deployment of large-scale analytics. Examples of such platforms include Hadoop (Apache) and Dryad (Microsoft). These platforms have been developed by the large-scale distributed processing community and can not only simplify implementation but also support execution on the cloud making large-scale machine learning and data mining both affordable and available to all. Today, there is a large gap between the data mining/machine learning and the large scale distributed processing communities. To make advances in large-scale analytics it is imperative that both these communities work hand-in-hand. The intent of this workshop is to further research efforts on large-scale data mining and to encourage researchers and practitioners to share their studies and experiences on the implementation and deployment of scalable data mining and machine learning algorithms.

Full Text