Some Remedies for Several Class Imbalance

Dr. Juan Orduz
satRday Berlin - 15.06.2019

Content

1. Data Set Description

2. Data Preparation

3. Some Model Performance Metrics

4. Machine Learning Models

  • PLS
  • GBM

5. Experiments & Results

  • Max Accuracy
  • Max Sensitivity
  • Altenative Cuttofs
  • Sampling Methods

6. Other Techniques

7. References & Contact

Data Set

data(AdultUCI, package = "arules")
raw_data <- AdultUCI

glimpse(raw_data, width = 60)
Observations: 48,842
Variables: 15
$ age              <int> 39, 50, 38, 53, 28, 37, 49, 52, …
$ workclass        <fct> State-gov, Self-emp-not-inc, Pri…
$ fnlwgt           <int> 77516, 83311, 215646, 234721, 33…
$ education        <ord> Bachelors, Bachelors, HS-grad, 1…
$ `education-num`  <int> 13, 13, 9, 7, 13, 14, 5, 9, 14, …
$ `marital-status` <fct> Never-married, Married-civ-spous…
$ occupation       <fct> Adm-clerical, Exec-managerial, H…
$ relationship     <fct> Not-in-family, Husband, Not-in-f…
$ race             <fct> White, White, White, Black, Blac…
$ sex              <fct> Male, Male, Male, Male, Female, …
$ `capital-gain`   <int> 2174, 0, 0, 0, 0, 0, 0, 0, 14084…
$ `capital-loss`   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `hours-per-week` <int> 40, 13, 40, 40, 40, 40, 16, 45, …
$ `native-country` <fct> United-States, United-States, Un…
$ income           <ord> small, small, small, small, smal…
data_df <- model_list$functions$format_raw_data(df = raw_data)

Income Variable

plot of chunk unnamed-chunk-4

Exploratory Data Analysis - Visualization

plot of chunk unnamed-chunk-5

Feature Engineering

\[ x \mapsto \log(x + 1) \]

plot of chunk unnamed-chunk-6

Data Preparation

df <- data_df %>% 
  mutate(capital_gain_log = log(`capital-gain` + 1), 
         capital_loss_log = log(`capital-loss` + 1)) %>% 
  select(- `capital-gain`, - `capital-loss`) %>% 
  drop_na()

# Define observation matrix and target vector. 
X <- df %>% select(- income)
y <- df %>% pull(income) %>% fct_rev()

# Add dummy variables. 
dummy_obj <- dummyVars("~ .", data = X, sep = "_")

X <- predict(object = dummy_obj, newdata = X) %>% as_tibble()

# Remove predictors with near zero variance. 
cols_to_rm <- colnames(X)[nearZeroVar(x = X, freqCut = 5000)]

X %<>% select(- cols_to_rm) 

Data Split

# Split train - other
split_index_1 <- createDataPartition(y = y, p = 0.7)$Resample1

X_train <- X[split_index_1, ]
y_train <- y[split_index_1]

X_other <- X[- split_index_1, ]
y_other <- y[- split_index_1]

split_index_2 <- createDataPartition(y = y_other, 
                                     p = 1/3)$Resample1

# Split evaluation - test
X_eval <- X_other[split_index_2, ]
y_eval <- y_other[split_index_2]

X_test <- X_other[- split_index_2, ]
y_test <- y_other[- split_index_2]

Confusion Matrix

We consider positive income = large.

Condition Positive Condition Negative
Prediction Positive TP FP
Prediction Negative FN TN
  • TP = True Positive
  • TN = True Negative
  • FP = False Positive
  • FN = False Negative
  • N = TP + TN + FP + FN

Performance Metrics

  • Accuracy

\[ \text{acc} = \frac{TP + TN}{N} \]

\[ \kappa = \frac{\text{acc} - p_e}{1 - p_e} \]

where \( p_e \) = Expected Accuracy (random chance).

The kappa metric can be thought as a modification of the accuracy metric based on the class proportions.

Performance Metrics

\[ \text{sens} = \frac{TP}{TP + FN} \]

\[ \text{spec} = \frac{TN}{TN + FP} \]

\[ \text{prec} = \frac{TP}{TP + FP} \]

\[ F_\beta = (1 + \beta^2)\frac{\text{prec}\times \text{recall}}{\beta^2\text{prec} + \text{recall}} \]

ROC Curve

The ROC is created by plotting the true positive rate (= sensitivity) against the false positive rate (1 − specificity) at various propability threshold.

  • AUC : Area under the ROC curve.

Machine Learning Models

1. Trivial Model

Always predict the same class.

2. Partial Least Squares + Logistic Regression

Supervised dimensionality reduction.

3. Stochastic Gradient Boosting

Tree ensemble model.

Trivial Model

We predict the same class income = small

y_pred_trivial <- map_chr(.x = y_test, .f = ~ "small") %>% 
  as_factor(ordered = TRUE, levels = c("small", "large"))

We compute the confusion matrix to get the metrics.

# Confusion Matrix. 
conf_matrix_trivial <-  confusionMatrix(data = y_pred_trivial, 
                                        reference =  y_test)
term estimate
accuracy 0.751
kappa 0.000
sensitivity 0.000
specificity 1.000

Trivial Model - ROC

plot of chunk unnamed-chunk-13

We can use the pROC package.

Train Control + Train in Caret

 five_stats <- function (...) {

  c(twoClassSummary(...), defaultSummary(...))

}

# Define cross validation.
cv_num <- 7

train_control <- trainControl(method = "cv",
                              number = cv_num,
                              classProbs = TRUE, 
                              summaryFunction = five_stats,
                              allowParallel = TRUE, 
                              verboseIter = FALSE)
model_obj <- train(x = X_train,
                   y = y_train,
                   method = method,
                   tuneLength = 10,
                   # For linear models we scale and center. 
                   preProcess = c("scale", "center"), 
                   trControl = train_control,
                   metric = metric)

PLS Model - Max Accuracy

accuracy kappa sensitivity specificity
0.838 0.527 0.552 0.932

plot of chunk unnamed-chunk-16

GBM Model - Max Accuracy

accuracy kappa sensitivity specificity
0.869 0.629 0.655 0.94

plot of chunk unnamed-chunk-18

PLS Model - Max Sensitivity

accuracy kappa sensitivity specificity
0.84 0.535 0.561 0.932

plot of chunk unnamed-chunk-20

GBM Model - Max Sensitivity

accuracy kappa sensitivity specificity
0.87 0.635 0.669 0.936

plot of chunk unnamed-chunk-22

GBM Model - Max Sensitivity

plot of chunk unnamed-chunk-23

PLS Model - Alternative Cut-Off

plot of chunk unnamed-chunk-24

PLS Model - Alternative Cut-Off

accuracy kappa sensitivity specificity
0.8 0.534 0.82 0.793

plot of chunk unnamed-chunk-25

GMB Model - Alternative Cut-Off

plot of chunk unnamed-chunk-26

GBM Model - Alternative Cut-Off

accuracy kappa sensitivity specificity
0.836 0.611 0.862 0.827

plot of chunk unnamed-chunk-27

Sampling Methods - Up/Down Sampling

  • Up-sampling is any technique that simulates or imputes additional data points to improve balance across classes.

  • Down-sampling is any technique that reduces the number of samples to improve the balance across classes.

In caret:

df_upSample_train <- upSample(x = X_train, 
                              y = y_train, 
                              yname = "income")

X_upSample_train <- df_upSample_train %>% select(- income) 
y_upSample_train <- df_upSample_train %>% pull(income)
class value share
large 16148 0.5
small 16148 0.5

PLS Model - Up Sampling

accuracy kappa sensitivity specificity
0.791 0.527 0.854 0.77

plot of chunk unnamed-chunk-30

GBM Model - Up Sampling

accuracy kappa sensitivity specificity
0.841 0.62 0.854 0.837

plot of chunk unnamed-chunk-32

Sampling Methods - SMOTE

SMOTE is a data sampling procedure that uses both up-sampling and down-sampling. To up-sample for the minority class, it synthesizes new cases: a data point is randomly selected from the minority class and its K-nearest neighbors are determined. The new synthetic data point is a random combination of the predictors of the randomly selected data point and its neighbors.

We can use the DMwR package:

df_smote_train <-  DMwR::SMOTE(
  form = income ~ ., 
  perc.over = 200, 
  perc.under = 150, 
  data = as.data.frame(bind_cols(income = y_train, X_train))
)

X_smote_train <- df_smote_train  %>% select(- income) 
y_smote_train <- df_smote_train  %>% pull(income) 
class value share
large 16065 0.5
small 16065 0.5

PLS Model - SMOTE

accuracy kappa sensitivity specificity
0.799 0.52 0.774 0.807

plot of chunk unnamed-chunk-35

GBM Model - SMOTE

accuracy kappa sensitivity specificity
0.864 0.624 0.678 0.925

plot of chunk unnamed-chunk-37

Model Summary - PLS

Method Tag Sensitivity Specificity Precision Recall F1
pls Accuracy 0.552 0.932 0.730 0.552 0.628
pls Sens 0.561 0.932 0.733 0.561 0.636
pls Alt Cutoff 0.820 0.793 0.568 0.820 0.671
pls Up Sampling 0.854 0.770 0.552 0.854 0.670
pls SMOTE 0.774 0.807 0.571 0.774 0.657

Model Summary - GMB

Method Tag Sensitivity Specificity Precision Recall F1
gbm Accuracy 0.655 0.940 0.782 0.655 0.713
gbm Sens 0.669 0.936 0.777 0.669 0.719
gbm Alt Cutoff 0.862 0.827 0.624 0.862 0.724
gbm Up Sampling 0.854 0.837 0.635 0.854 0.728
gbm SMOTE 0.678 0.925 0.751 0.678 0.713

Other Techniques

  • Adjusting Prior Probabilities

  • Cost-Sensitive Training

References & Contact

Book:

Applied Predictive Modeling, by Max Kuhn and Kjell Johnson.

Blog Post:

https://juanitorduz.github.io/class_imbalance

Contact:

juanitorduz@gmail.com