Introduction

Imbalanced data are the situation where the less represented observations of the data are of the main interest. In some contexts, they are expressed as “outliers” which is rather more dangerous. As a consequence of the “outliers” expression, such observations are excluded or removed from the data. That eventually ends up with unrepresented analysis and misleading results.

In our current world with all the extreme events of woods fire, pandemic diseases, and economic crises, it is easy to spot different fields of application for imbalanced data such as meteorological/ecological catastrophes, bank frauds, high-risk insurance portfolios, electricity pilferage,…, etc.

Addressing classification problems for imbalanced data are quite famous and available in lots of papers and articles. While regression problems are overlooked almost every time, although dealing with them is significantly different. Basically most of the classification problems originally come from continuous variables that have been transferred to categorical for classification analysis!. In the transformation process, patterns and dependencies are ignored due to the change in the data type.

Overview

This article is covering:

General intuition and techniques for dealing with imbalanced data for regression
Data preprocessing techniques
Model processing using the UBR technique
Evaluation metrics for imbalanced regression
Application on UBR using imbalanced data
Conclusion

| Techniques for Imbalanced Data in Regression

The existing machine learning models for regression are mainly constructed based on balanced or almost balanced data. That leads to misleading and very bad performance for such models when dealing with imbalanced data. To be able to deal with imbalanced data using these models, you have one of two options: first, is to increase the representation of the observations of interest vs. the other observations (or vice versa). Second, is to adapt the model itself by parameter tuning based on customized criteria. I am going to discuss these two main strategies to deal with imbalanced data, namely data preprocessing and model processing. There are other two strategies: post-processing and hybrid, but I am not addressing them in this article.

| Preprocessing Techniques

Preprocessing techniques are mainly focused on applying oversampling, undersampling, or mixture between them on the data prior to applying the traditional machine learning regression model.

“Preprocessing techniques force the model to learn about the rare observations of interest in the data.”

For classification, preprocessing has a way easier task due to the clear segregation between the classes, which is already defined from the beginning. But in case of the continuous random variable, in almost all references the what so-called “Relevance function” is used to deal with such a difficult mission. The relevance function takes a value between 0 and 1, whenever the relevance tends more to 1 that refers more to having highly imbalanced data.

Heads up! in the continuous case, the imbalanced data are more referred to as skewed data. The relevance function definition varies based on the data and the available information about its distribution as well, later I will give an example of one of the used definitions.

There are few preprocessing techniques for continuous variables such as the random undersampling, SMOTE for regression, Gaussian noise, SMOGN algorithm. Briefly speaking, SMOTE for regression is considered as an oversampling technique. Gaussian noise and SMOGN algorithms are a mixture of both under/oversampling techniques.