Some Remedies for Several Class Imbalance

Dr. Juan Orduz
satRday Berlin - 15.06.2019

Content

1. Data Set Description

2. Data Preparation

3. Some Model Performance Metrics

4. Machine Learning Models

5. Experiments & Results

Max Accuracy
Max Sensitivity
Altenative Cuttofs
Sampling Methods

6. Other Techniques

7. References & Contact

Data Set

data(AdultUCI, package = "arules")
raw_data <- AdultUCI

glimpse(raw_data, width = 60)

Observations: 48,842
Variables: 15
$ age              <int> 39, 50, 38, 53, 28, 37, 49, 52, …
$ workclass        <fct> State-gov, Self-emp-not-inc, Pri…
$ fnlwgt           <int> 77516, 83311, 215646, 234721, 33…
$ education        <ord> Bachelors, Bachelors, HS-grad, 1…
$ `education-num`  <int> 13, 13, 9, 7, 13, 14, 5, 9, 14, …
$ `marital-status` <fct> Never-married, Married-civ-spous…
$ occupation       <fct> Adm-clerical, Exec-managerial, H…
$ relationship     <fct> Not-in-family, Husband, Not-in-f…
$ race             <fct> White, White, White, Black, Blac…
$ sex              <fct> Male, Male, Male, Male, Female, …
$ `capital-gain`   <int> 2174, 0, 0, 0, 0, 0, 0, 0, 14084…
$ `capital-loss`   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `hours-per-week` <int> 40, 13, 40, 40, 40, 40, 16, 45, …
$ `native-country` <fct> United-States, United-States, Un…
$ income           <ord> small, small, small, small, smal…

data_df <- model_list$functions$format_raw_data(df = raw_data)

Income Variable

plot of chunk unnamed-chunk-4

Exploratory Data Analysis - Visualization

plot of chunk unnamed-chunk-5

Feature Engineering

\[ x \mapsto \log(x + 1) \]

plot of chunk unnamed-chunk-6

Data Preparation

df <- data_df %>% 
  mutate(capital_gain_log = log(`capital-gain` + 1), 
         capital_loss_log = log(`capital-loss` + 1)) %>% 
  select(- `capital-gain`, - `capital-loss`) %>% 
  drop_na()

# Define observation matrix and target vector. 
X <- df %>% select(- income)
y <- df %>% pull(income) %>% fct_rev()

# Add dummy variables. 
dummy_obj <- dummyVars("~ .", data = X, sep = "_")

X <- predict(object = dummy_obj, newdata = X) %>% as_tibble()

# Remove predictors with near zero variance. 
cols_to_rm <- colnames(X)[nearZeroVar(x = X, freqCut = 5000)]

X %<>% select(- cols_to_rm)

Data Split

# Split train - other
split_index_1 <- createDataPartition(y = y, p = 0.7)$Resample1

X_train <- X[split_index_1, ]
y_train <- y[split_index_1]

X_other <- X[- split_index_1, ]
y_other <- y[- split_index_1]

split_index_2 <- createDataPartition(y = y_other, 
                                     p = 1/3)$Resample1

# Split evaluation - test
X_eval <- X_other[split_index_2, ]
y_eval <- y_other[split_index_2]

X_test <- X_other[- split_index_2, ]
y_test <- y_other[- split_index_2]

Confusion Matrix

We consider positive income = large.

	Condition Positive	Condition Negative
Prediction Positive	TP	FP
Prediction Negative	FN	TN

TP = True Positive
TN = True Negative
FP = False Positive
FN = False Negative
N = TP + TN + FP + FN

Performance Metrics

Accuracy

\[ \text{acc} = \frac{TP + TN}{N} \]

Kappa

\[ \kappa = \frac{\text{acc} - p_e}{1 - p_e} \]

where \( p_e \) = Expected Accuracy (random chance).

The kappa metric can be thought as a modification of the accuracy metric based on the class proportions.

Performance Metrics

Sensitivity (= Recall)

\[ \text{sens} = \frac{TP}{TP + FN} \]

Specificity

\[ \text{spec} = \frac{TN}{TN + FP} \]

Precision

\[ \text{prec} = \frac{TP}{TP + FP} \]

\( F_\beta \)

\[ F_\beta = (1 + \beta^2)\frac{\text{prec}\times \text{recall}}{\beta^2\text{prec} + \text{recall}} \]

ROC Curve

The ROC is created by plotting the true positive rate (= sensitivity) against the false positive rate (1 − specificity) at various propability threshold.

AUC : Area under the ROC curve.

Machine Learning Models

1. Trivial Model

Always predict the same class.

2. Partial Least Squares + Logistic Regression

Supervised dimensionality reduction.

3. Stochastic Gradient Boosting

Tree ensemble model.

Trivial Model

We predict the same class income = small

y_pred_trivial <- map_chr(.x = y_test, .f = ~ "small") %>% 
  as_factor(ordered = TRUE, levels = c("small", "large"))

We compute the confusion matrix to get the metrics.

# Confusion Matrix. 
conf_matrix_trivial <-  confusionMatrix(data = y_pred_trivial, 
                                        reference =  y_test)

term	estimate
accuracy	0.751
kappa	0.000
sensitivity	0.000
specificity	1.000

Trivial Model - ROC

plot of chunk unnamed-chunk-13

We can use the pROC package.

Train Control + Train in Caret

 five_stats <- function (...) {

  c(twoClassSummary(...), defaultSummary(...))

}

# Define cross validation.
cv_num <- 7

train_control <- trainControl(method = "cv",
                              number = cv_num,
                              classProbs = TRUE, 
                              summaryFunction = five_stats,
                              allowParallel = TRUE, 
                              verboseIter = FALSE)

model_obj <- train(x = X_train,
                   y = y_train,
                   method = method,
                   tuneLength = 10,
                   # For linear models we scale and center. 
                   preProcess = c("scale", "center"), 
                   trControl = train_control,
                   metric = metric)

PLS Model - Max Accuracy

accuracy	kappa	sensitivity	specificity
0.838	0.527	0.552	0.932

plot of chunk unnamed-chunk-16

GBM Model - Max Accuracy

accuracy	kappa	sensitivity	specificity
0.869	0.629	0.655	0.94

plot of chunk unnamed-chunk-18

PLS Model - Max Sensitivity

accuracy	kappa	sensitivity	specificity
0.84	0.535	0.561	0.932

plot of chunk unnamed-chunk-20

GBM Model - Max Sensitivity

accuracy	kappa	sensitivity	specificity
0.87	0.635	0.669	0.936

plot of chunk unnamed-chunk-22

GBM Model - Max Sensitivity

plot of chunk unnamed-chunk-23

PLS Model - Alternative Cut-Off

plot of chunk unnamed-chunk-24

PLS Model - Alternative Cut-Off

accuracy	kappa	sensitivity	specificity
0.8	0.534	0.82	0.793

plot of chunk unnamed-chunk-25

GMB Model - Alternative Cut-Off

plot of chunk unnamed-chunk-26

GBM Model - Alternative Cut-Off

accuracy	kappa	sensitivity	specificity
0.836	0.611	0.862	0.827

plot of chunk unnamed-chunk-27

Sampling Methods - Up/Down Sampling

Up-sampling is any technique that simulates or imputes additional data points to improve balance across classes.
Down-sampling is any technique that reduces the number of samples to improve the balance across classes.

In caret:

df_upSample_train <- upSample(x = X_train, 
                              y = y_train, 
                              yname = "income")

X_upSample_train <- df_upSample_train %>% select(- income) 
y_upSample_train <- df_upSample_train %>% pull(income)

class	value	share
large	16148	0.5
small	16148	0.5

PLS Model - Up Sampling

accuracy	kappa	sensitivity	specificity
0.791	0.527	0.854	0.77

plot of chunk unnamed-chunk-30

GBM Model - Up Sampling

accuracy	kappa	sensitivity	specificity
0.841	0.62	0.854	0.837

plot of chunk unnamed-chunk-32

Sampling Methods - SMOTE

SMOTE is a data sampling procedure that uses both up-sampling and down-sampling. To up-sample for the minority class, it synthesizes new cases: a data point is randomly selected from the minority class and its K-nearest neighbors are determined. The new synthetic data point is a random combination of the predictors of the randomly selected data point and its neighbors.

We can use the DMwR package:

df_smote_train <-  DMwR::SMOTE(
  form = income ~ ., 
  perc.over = 200, 
  perc.under = 150, 
  data = as.data.frame(bind_cols(income = y_train, X_train))
)

X_smote_train <- df_smote_train  %>% select(- income) 
y_smote_train <- df_smote_train  %>% pull(income)

class	value	share
large	16065	0.5
small	16065	0.5

PLS Model - SMOTE

accuracy	kappa	sensitivity	specificity
0.799	0.52	0.774	0.807

plot of chunk unnamed-chunk-35

GBM Model - SMOTE

accuracy	kappa	sensitivity	specificity
0.864	0.624	0.678	0.925

plot of chunk unnamed-chunk-37

Model Summary - PLS

Method	Tag	Sensitivity	Specificity	Precision	Recall	F1
pls	Accuracy	0.552	0.932	0.730	0.552	0.628
pls	Sens	0.561	0.932	0.733	0.561	0.636
pls	Alt Cutoff	0.820	0.793	0.568	0.820	0.671
pls	Up Sampling	0.854	0.770	0.552	0.854	0.670
pls	SMOTE	0.774	0.807	0.571	0.774	0.657

Model Summary - GMB

Method	Tag	Sensitivity	Specificity	Precision	Recall	F1
gbm	Accuracy	0.655	0.940	0.782	0.655	0.713
gbm	Sens	0.669	0.936	0.777	0.669	0.719
gbm	Alt Cutoff	0.862	0.827	0.624	0.862	0.724
gbm	Up Sampling	0.854	0.837	0.635	0.854	0.728
gbm	SMOTE	0.678	0.925	0.751	0.678	0.713

Other Techniques

Adjusting Prior Probabilities
Cost-Sensitive Training
…

References & Contact

Book:

Applied Predictive Modeling, by Max Kuhn and Kjell Johnson.

Blog Post:

https://juanitorduz.github.io/class_imbalance

Contact:

juanitorduz@gmail.com