Large-scale hierarchical classification with rare categories and inconsistencies

Azad Naik,Huzefa Rangwala

doi:10.1145/2911172.2911182

Abstract

Structuring data is crucial for managing massive amount of available data. Hierarchy (taxonomy) provides a natural and convenient way to organize the information. It has been extensively used in several domains, such as gene taxonomy for organizing gene sequences, international patent hierarchy for easy browsing and retrieval of patent documents, DMOZ taxonomy for web-pages categorization, and ImageNet database for indexing millions of images according to WordNet hierarchy. Given, a hierarchy containing thousands of classes (categories) and millions of instances (examples), there is an essential need to develop an efficient and automated approaches to categorize unlabeled test instances. This problem is referred to as Hierarchical Classification (HC) task. HC is an important machine learning problem that has been researched and explored extensively in the past few years (Silla Jr & Freitas, 2011). The popularity of large-scale HC problem is evident from various HC competitions organized in the past few years such as LSHTC 1 , BioASQ 2 and ILSVRC 3 . HC poses several challenges due to the following reasons: (i) Data imbalance with large number of classes having very few positive examples for training (rare categories), (ii) Multi-label classification, (iii) Feature selection, (iv) Inconsistent hierarchy due to domain experts manual design, and (v) Scalability due to large number of examples, features and classes. Several approaches that address these issues individually (or multiple issues together) have been developed over the years (Gopal & Yang, 2013; Babbar et al., 2013), however there are many possibilities of improving the existing methods. Specifically, we have developed the methods for handling rare categories and inconsistent hierarchy problem.

Full Text