Abstract

Most machine learning methods work under the assumption that classes have a roughly balanced number of instances. However, in many real-life problems we may have some types of instances appearing predominantly more frequently than the others which causes a bias towards the majority class during classifier training. This becomes even more challenging when dealing with multiple classes, where relationships between them are not easily defined. Learning from multi-class imbalanced data has not been widely considered in the context of big data mining, despite the fact that this is a learning difficulty frequently appearing in this domain. In this paper, we address this challenge by proposing a comprehensive ensemble-based framework. We propose to analyze each class to extract instance-level characteristics describing their difficulty levels. We embed this information into the existing UnderBagging framework. Our ensemble samples instances with probabilities proportional to their difficulty levels. This allows us to focus the learning process on the most difficult instances, better capturing the properties of multi-class imbalanced problems. We implemented our framework on Apache Spark to allow for high-performance computing over big data sets. This experimental study shows that taking into account the instance-level difficulty leads to training of significantly more accurate ensembles.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.