Revisiting the Dissimilarity Representation in the Context of Regression

Vicente Garcia,Rafael Martinez-Pelaez,Luis C Mendez-Gonzalez,J Salvador Sanchez

doi:10.1109/access.2021.3130127

Abstract

In machine learning, a natural way to represent an instance is by using a feature vector. However, several studies have shown that this representation may not accurately characterize an object. For classification problems, the dissimilarity paradigm has been proposed as an alternative to the standard feature-based approach. Encoding each object by pairwise dissimilarities has been demonstrated to improve the data quality because it mitigates some complexities such as class overlap, small disjuncts, and low-sample size. However, its suitability and performance when applied to regression problems have not been fully explored. This study redefines the dissimilarity representation for regression. To this end, we have carried out an extensive experimental evaluation on 34 datasets using two linear regression models. The results show that the dissimilarity approach decreases the error rates of both the traditional linear regression and the linear model with elastic net regularization, and it also reduces the complexity of most regression datasets.

Highlights

A N underlying step in machine learning and pattern recognition is the characterization of objects, where an ideally good representation ensures the building of accurate learning algorithms [1]
EXPERIMENTAL SET-UP Taking into account that the ultimate goal of this work is to investigate the benefits of the dissimilarity representation over the feature representation in the context of regression, we performed a systematic experimental study using two linear regression models and a pool of gold-standard data sets
We analyzed the effect of selecting different representation set sizes on the performance of the regression models

Summary

Introduction

A N underlying step in machine learning and pattern recognition is the characterization of objects, where an ideally good representation ensures the building of accurate learning algorithms [1]. Xn]T ∈ Rn, where each xi is a numeric attribute (feature) whose values are obtained through observation or as samples of the data (e.g., pixels of an image) [3], [4]. This representation may not capture the internal structure of some objects that have an intrinsic and detectable organization [5]–[7]. It is often difficult to obtain an appropriate feature-based characterization of objects, leading to a high dimensional representation with class overlap or a representation with a mixture of continuous and categorical features [5], [8]. Several studies have demonstrated that this alternative representation suggests practical advantages over the feature representation such as: i) it is possible to use a simple linear prediction model [10], ii) it yields a good separability between classes [11], iii) all dimensions in the dissimilarity space are relevant [11], and iv) the small disjunct problem is reduced [12]

Objectives

Results

Conclusion