Extracting Features from Textual Data in Class Imbalance Problems

Sarang Aravamuthan,Prasad Jogalekar,Jonghae Lee

doi:10.4995/jclr.2022.18200

Sarang Aravamuthan, Prasad Jogalekar + Show 1 more

Open Access

https://doi.org/10.4995/jclr.2022.18200

Copy DOI

Abstract

We address class imbalance problems. These are classification problems where the target variable is binary, and one class dominates over the other. A central objective in these problems is to identify features that yield models with high precision/recall values, the standard yardsticks for assessing such models. Our features are extracted from the textual data inherent in such problems. We use n-gram frequencies as features and introduce a discrepancy score that measures the efficacy of an n-gram in highlighting the minority class. The frequency counts of n-grams with the highest discrepancy scores are used as features to construct models with the desired metrics. According to the best practices followed by the services industry, many customer support tickets will get audited and tagged as “contract-compliant” whereas some will be tagged as “over-delivered”. Based on in-field data, we use a random forest classifier and perform a randomized grid search over the model hyperparameters. The model scoring is performed using an scoring function. Our objective is to minimize the follow-up costs by optimizing the recall score while maintaining a base-level precision score. The final optimized model achieves an acceptable recall score while staying above the target precision. We validate our feature selection method by comparing our model with one constructed using frequency counts of n-grams chosen randomly. We propose extensions of our feature extraction method to general classification (binary and multi-class) and regression problems. The discrepancy score is one measure of dissimilarity of distributions and other (more general) measures that we formulate could potentially yield more effective models.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Extracting Features from Textual Data in Class Imbalance Problems

Abstract

Talk to us

Similar Papers

More From: Journal of Computer-Assisted Linguistic Research

Lead the way for us

Journal: Journal of Computer-Assisted Linguistic Research	Publication Date: Nov 23, 2022
License type: CC BY-NC-ND 4.0

Similar Papers

A Novel Modelling Technique for Early Recognition and Classification of Alzheimer’s disease
Dinu A J ... Manju R
-
Dinu A J, et. al.Dinu A J ... Manju R
13 May 2021
13 May 2021

An intelligent system for accelerating parallel SVM classification problems on large datasets using GPU
Qi Li ... Vojislav Kecman
-
Qi Li, et. al.Qi Li ... Vojislav Kecman
01 Nov 2010
01 Nov 2010

Adaptive application of machine learning models on separate segments of a data sample in regression and classification problems
Iliya Lebedev
Информационно-управляющие системы | VOL. -
Iliya LebedevIliya Lebedev
24 Jun 2022
Информационно-управляющие системы | VOL. -

Power Equipment Defects Prediction Based on the Joint Solution of Classification and Regression Problems Using Machine Learning Methods
Ivan Shcherbatov ... Marek Dvořák
Electronics | VOL. 10
Ivan Shcherbatov, et. al.Ivan Shcherbatov ... Marek Dvořák
17 Dec 2021
Electronics | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Extracting Features from Textual Data in Class Imbalance Problems

Abstract

Talk to us

Similar Papers

More From: Journal of Computer-Assisted Linguistic Research