Dr. Juan Orduz
satRday Berlin - 15.06.2019
data(AdultUCI, package = "arules")
raw_data <- AdultUCI
glimpse(raw_data, width = 60)
Observations: 48,842
Variables: 15
$ age <int> 39, 50, 38, 53, 28, 37, 49, 52, …
$ workclass <fct> State-gov, Self-emp-not-inc, Pri…
$ fnlwgt <int> 77516, 83311, 215646, 234721, 33…
$ education <ord> Bachelors, Bachelors, HS-grad, 1…
$ `education-num` <int> 13, 13, 9, 7, 13, 14, 5, 9, 14, …
$ `marital-status` <fct> Never-married, Married-civ-spous…
$ occupation <fct> Adm-clerical, Exec-managerial, H…
$ relationship <fct> Not-in-family, Husband, Not-in-f…
$ race <fct> White, White, White, Black, Blac…
$ sex <fct> Male, Male, Male, Male, Female, …
$ `capital-gain` <int> 2174, 0, 0, 0, 0, 0, 0, 0, 14084…
$ `capital-loss` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `hours-per-week` <int> 40, 13, 40, 40, 40, 40, 16, 45, …
$ `native-country` <fct> United-States, United-States, Un…
$ income <ord> small, small, small, small, smal…
data_df <- model_list$functions$format_raw_data(df = raw_data)
\[ x \mapsto \log(x + 1) \]
df <- data_df %>%
mutate(capital_gain_log = log(`capital-gain` + 1),
capital_loss_log = log(`capital-loss` + 1)) %>%
select(- `capital-gain`, - `capital-loss`) %>%
drop_na()
# Define observation matrix and target vector.
X <- df %>% select(- income)
y <- df %>% pull(income) %>% fct_rev()
# Add dummy variables.
dummy_obj <- dummyVars("~ .", data = X, sep = "_")
X <- predict(object = dummy_obj, newdata = X) %>% as_tibble()
# Remove predictors with near zero variance.
cols_to_rm <- colnames(X)[nearZeroVar(x = X, freqCut = 5000)]
X %<>% select(- cols_to_rm)
# Split train - other
split_index_1 <- createDataPartition(y = y, p = 0.7)$Resample1
X_train <- X[split_index_1, ]
y_train <- y[split_index_1]
X_other <- X[- split_index_1, ]
y_other <- y[- split_index_1]
split_index_2 <- createDataPartition(y = y_other,
p = 1/3)$Resample1
# Split evaluation - test
X_eval <- X_other[split_index_2, ]
y_eval <- y_other[split_index_2]
X_test <- X_other[- split_index_2, ]
y_test <- y_other[- split_index_2]
We consider positive income
= large
.
Condition Positive | Condition Negative | |
---|---|---|
Prediction Positive | TP | FP |
Prediction Negative | FN | TN |
\[ \text{acc} = \frac{TP + TN}{N} \]
\[ \kappa = \frac{\text{acc} - p_e}{1 - p_e} \]
where \( p_e \) = Expected Accuracy (random chance).
The kappa metric can be thought as a modification of the accuracy metric based on the class proportions.
\[ \text{sens} = \frac{TP}{TP + FN} \]
\[ \text{spec} = \frac{TN}{TN + FP} \]
\[ \text{prec} = \frac{TP}{TP + FP} \]
\[ F_\beta = (1 + \beta^2)\frac{\text{prec}\times \text{recall}}{\beta^2\text{prec} + \text{recall}} \]
Always predict the same class.
Supervised dimensionality reduction.
Tree ensemble model.
We predict the same class income
= small
y_pred_trivial <- map_chr(.x = y_test, .f = ~ "small") %>%
as_factor(ordered = TRUE, levels = c("small", "large"))
We compute the confusion matrix to get the metrics.
# Confusion Matrix.
conf_matrix_trivial <- confusionMatrix(data = y_pred_trivial,
reference = y_test)
term | estimate |
---|---|
accuracy | 0.751 |
kappa | 0.000 |
sensitivity | 0.000 |
specificity | 1.000 |
We can use the pROC package.
five_stats <- function (...) {
c(twoClassSummary(...), defaultSummary(...))
}
# Define cross validation.
cv_num <- 7
train_control <- trainControl(method = "cv",
number = cv_num,
classProbs = TRUE,
summaryFunction = five_stats,
allowParallel = TRUE,
verboseIter = FALSE)
model_obj <- train(x = X_train,
y = y_train,
method = method,
tuneLength = 10,
# For linear models we scale and center.
preProcess = c("scale", "center"),
trControl = train_control,
metric = metric)
accuracy | kappa | sensitivity | specificity |
---|---|---|---|
0.838 | 0.527 | 0.552 | 0.932 |
accuracy | kappa | sensitivity | specificity |
---|---|---|---|
0.869 | 0.629 | 0.655 | 0.94 |
accuracy | kappa | sensitivity | specificity |
---|---|---|---|
0.84 | 0.535 | 0.561 | 0.932 |
accuracy | kappa | sensitivity | specificity |
---|---|---|---|
0.87 | 0.635 | 0.669 | 0.936 |
accuracy | kappa | sensitivity | specificity |
---|---|---|---|
0.8 | 0.534 | 0.82 | 0.793 |
accuracy | kappa | sensitivity | specificity |
---|---|---|---|
0.836 | 0.611 | 0.862 | 0.827 |
Up-sampling is any technique that simulates or imputes additional data points to improve balance across classes.
Down-sampling is any technique that reduces the number of samples to improve the balance across classes.
In caret:
df_upSample_train <- upSample(x = X_train,
y = y_train,
yname = "income")
X_upSample_train <- df_upSample_train %>% select(- income)
y_upSample_train <- df_upSample_train %>% pull(income)
class | value | share |
---|---|---|
large | 16148 | 0.5 |
small | 16148 | 0.5 |
accuracy | kappa | sensitivity | specificity |
---|---|---|---|
0.791 | 0.527 | 0.854 | 0.77 |
accuracy | kappa | sensitivity | specificity |
---|---|---|---|
0.841 | 0.62 | 0.854 | 0.837 |
SMOTE is a data sampling procedure that uses both up-sampling and down-sampling. To up-sample for the minority class, it synthesizes new cases: a data point is randomly selected from the minority class and its K-nearest neighbors are determined. The new synthetic data point is a random combination of the predictors of the randomly selected data point and its neighbors.
We can use the DMwR package:
df_smote_train <- DMwR::SMOTE(
form = income ~ .,
perc.over = 200,
perc.under = 150,
data = as.data.frame(bind_cols(income = y_train, X_train))
)
X_smote_train <- df_smote_train %>% select(- income)
y_smote_train <- df_smote_train %>% pull(income)
class | value | share |
---|---|---|
large | 16065 | 0.5 |
small | 16065 | 0.5 |
accuracy | kappa | sensitivity | specificity |
---|---|---|---|
0.799 | 0.52 | 0.774 | 0.807 |
accuracy | kappa | sensitivity | specificity |
---|---|---|---|
0.864 | 0.624 | 0.678 | 0.925 |
Method | Tag | Sensitivity | Specificity | Precision | Recall | F1 |
---|---|---|---|---|---|---|
pls | Accuracy | 0.552 | 0.932 | 0.730 | 0.552 | 0.628 |
pls | Sens | 0.561 | 0.932 | 0.733 | 0.561 | 0.636 |
pls | Alt Cutoff | 0.820 | 0.793 | 0.568 | 0.820 | 0.671 |
pls | Up Sampling | 0.854 | 0.770 | 0.552 | 0.854 | 0.670 |
pls | SMOTE | 0.774 | 0.807 | 0.571 | 0.774 | 0.657 |
Method | Tag | Sensitivity | Specificity | Precision | Recall | F1 |
---|---|---|---|---|---|---|
gbm | Accuracy | 0.655 | 0.940 | 0.782 | 0.655 | 0.713 |
gbm | Sens | 0.669 | 0.936 | 0.777 | 0.669 | 0.719 |
gbm | Alt Cutoff | 0.862 | 0.827 | 0.624 | 0.862 | 0.724 |
gbm | Up Sampling | 0.854 | 0.837 | 0.635 | 0.854 | 0.728 |
gbm | SMOTE | 0.678 | 0.925 | 0.751 | 0.678 | 0.713 |
Adjusting Prior Probabilities
Cost-Sensitive Training
…
Applied Predictive Modeling, by Max Kuhn and Kjell Johnson.
https://juanitorduz.github.io/class_imbalance