# Exploring the Curse of Dimensionality - Part I.

In this post I want to present the notion of curse of dimensionality following a suggested exercise (Chapter 4 - Ex. 4) of the book An Introduction to Statistical Learning, written by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.

When the number of features $$p$$ is large, there tends to be a deterioration in the performance of KNN and other local approaches that perform prediction using only observations that are near the test observation for which a prediction must be made. This phenomenon is known as the curse of dimensionality, and it ties into the fact that non-parametric approaches often perform poorly when $$p$$ is large. We will now investigate this curse.

1. Suppose that we have a set of observations, each with measurements on $$p = 1$$ feature, $$X$$. We assume that $$X$$ is uniformly (evenly) distributed on $$[0,1]$$. Associated with each observation is a response value. Suppose that we wish to predict a test observation’s response using only observations that are within $$10$$% of the range of $$X$$ closest to that test observation. For instance, in order to predict the response for a test observation with $$X = 0.6$$ we will use observations in the range $$[0.55,0.65]$$. On average, what fraction of the available observations will we use to make the prediction?

Let us prepare the notebook.

library(glue)
library(tidyverse)

Let $$\lambda:= 0.1$$ represent the locality input parameter. i.e $$10$$%.

lambda <- 0.1

Let us generate $$X$$, i.e. the sample observations.

# Set dimension.
p <- 1

# Number of sample points.
n0 = 100

# Generate observation matrix
X <- runif(n = n0, min = 0, max = 1)  %>% matrix(ncol = p)

Given $$z_0 \in [0,1]$$, let us define a function which returns the $$\lambda$$-interval. We need to consider separate cases, depending wether the $$z_0$$ has distance less than $$\lambda/2$$ to some of the boundary points.

GenerateInterval <- function (z0, lambda) {
#'Generate lambda-interval
#'
#'@param z0 point in the unit interval.
#'@param lambda locality parameter.
#'@return lambda-interval of z0.

if (z0 < lambda/2) {

u.min <- 0
u.max <- lambda

} else if (z0 < 1 - lambda/2) {

u.min <- z0 - lambda/2
u.max <- z0 + lambda/2

} else {

u.min <- 1 - lambda
u.max <- 1

}

interval <- c(u.min, u.max)

return(interval)
}

Let us consider some examples:

z0 <- 0.6

GenerateInterval(z0 = z0, lambda = lambda)
## [1] 0.55 0.65
z0 <- 0

GenerateInterval(z0 = z0, lambda = lambda)
## [1] 0.0 0.1

Now let us write a function which verifies if a point $$x \in [0,1]$$ belongs to a given interval.

IsInInterval <- function (x, interval) {
#'Evaluate wether a given point belongs to the given interval.
#'
#'@param x point in the unit interval.
#'@param interval vector of length two indicating interval limits.
#'@return boolean indicating wether x belongs to the given interval.

(( x > interval[1] ) & ( x < interval[2] ))
}

For example,

z0 <- 0.6

interval <- GenerateInterval(z0 = z0, lambda = lambda)
IsInInterval(x = 0.5, interval = interval)
## [1] FALSE
IsInInterval(x = 0.57, interval = interval)
## [1] TRUE

Next, given any point and an observation matrix $$X$$ we want to count the number of points in the sample which belong to the $$\lambda$$-interval of the point.

CountAvailableObservations <- function (z0, X, lambda) {
#' Count observations in the lambda-interval of a point.
#'
#'@param z0 point in the unit interval.
#'@param X observation matrix.
#'@return number of samples in the lambda-interval of z0.

interval <- GenerateInterval(z0 = z0, lambda = lambda)

condition <- IsInInterval(x = X, interval = interval)

count.obs <-  X[condition] %>% length

return(count.obs)
}

Now we write a simulation function.

RunSimulation1 <- function(lambda, n0, n1, N) {
#' Run simulation.
#'
#'@param lambda locality parameter.
#'@param z0 point in the unit interval.
#'@param n0 number of samples in the observation matrix.
#'@param n1 number of samples in the prediction matrix.
#'@param N number of times to run the count.
#'@return vector of fraction counts.

# Run over simulation iterations.
1:N %>% map_dbl(.f = function(k) {
# Generate random observation samples.
X <- runif(n = n0, min = 0, max = 1)
# Generate prediction grid.
sample.grid <- runif(n = n1, min = 0, max = 1)

sample.grid %>%
map_dbl(.f = ~ CountAvailableObservations(z0 = .x, X= X, lambda = lambda)) %>%
mean

}) / n1
}

Let us plot the distribution.

sim.1 <- RunSimulation1(lambda = 0.1,
n0 = 100,
n1 = 100,
N = 100)

mean.sim.1 <- mean(sim.1)

sim.1 %>%
as_tibble() %>%
ggplot() +
theme_light() +
geom_histogram(mapping = aes(x = value),
fill = 'blue',
alpha = 0.7,
bins = 20) +
geom_vline(xintercept = mean.sim.1) +
ggtitle(label = 'Fraction Simulation Distribution for p = 1')
## Warning: Calling as_tibble() on a vector is discouraged, because the behavior is likely to change in the future. Use tibble::enframe(name = NULL) instead.
## This warning is displayed once per session.

This result is not surprising. Since the samples are uniformly distributed on $$[0,1]$$, the fraction is going to be, on average, equal to the length of an interval of length $$\lambda$$, which in this case is just $$0.1$$.

sim.1 %>%
as_tibble() %>%
ggplot() +
theme_light() +
geom_histogram(mapping = aes(x = value),
fill = 'blue',
alpha = 0.7,
bins = 20) +
geom_vline(xintercept = mean.sim.1) +
geom_vline(xintercept = lambda, color = 'red') +
ggtitle(label = 'Fraction Simulation Distribution for p = 1')

1. Now suppose that we have a set of observations, each with measurements on $$p = 2$$ features, $$X_1$$ and $$X_2$$. We assume that $$(X_1,X_2)$$ are uniformly distributed on $$[0,1]\times [0,1]$$. We wish to predict a test observation’s response using only observations that are within $$10$$% of the range of $$X_1$$ and within $$10$$% of the range of $$X_2$$ closest to that test observation. For instance, in order to predict the response for a test observation with $$X_1 = 0.6$$ and $$X2 = 0.35$$, we will use observations in the range $$[0.55, 0.65]$$ for $$X_1$$ and in the range $$[0.3, 0.4]$$ for $$X_2$$. On average, what fraction of the available observations will we use to make the prediction?

We argue similarly as above. We write a simulation function for all $$p>0$$.

RunSimulationP <- function(lambda, p, n0, n1, N) {
#' Run simulation.
#'
#'@param lambda locality parameter.
#'@param z0 point in the unit interval.
#'@param n0 number of samples in the observation matrix.
#'@param n1 number of samples in the prediction matrix.
#'@param N number of times to run the count.
#'@return vector of fraction counts.

# Run over simulation iterations.
1:N %>% map_dbl(.f = function(k) {
# Generate random observation samples.
X <- runif(n = p*n0, min = 0, max = 1) %>% matrix(ncol = p)
# Generate prediction grid.
sample.grid <- runif(n = p*n1, min = 0, max = 1) %>% matrix(ncol = p)

fraction.vect <- sapply(X = 1:n0, FUN = function(i) {

condition.matrix <- sapply(X = 1:p, FUN = function(j) {
# Select a point on the grid.
z0 <- sample.grid[i, j]
# Compute its interval.
interval<- GenerateInterval(z0 = z0, lambda = lambda)
# Check which values in the grid belong to the lambda-interval.
IsInInterval(x = X[, j], interval = interval)
})
# Define condition matrix for each dimension.
condition.vect <- condition.matrix %>%
as_tibble %>%
mutate_all(.funs = as.numeric) %>%
rowSums
# Select points which belong to the lambda-hypercube
# and then compute the fraction.
(condition.vect[condition.vect == p] %>% length) / n0
})

return(mean(fraction.vect))
})
}

Let us run the simulation.

p <- 2

sim.2 <- RunSimulationP(lambda = lambda,
p = p,
n0 = 100,
n1 = 100,
N = 100)
## Warning: as_tibble.matrix() requires a matrix with column names or a .name_repair argument. Using compatibility .name_repair.
## This warning is displayed once per session.
mean.sim.2 <- mean(sim.2)

mean.sim.2 
## [1] 0.010059

Again, since the samples are uniformly distributed on $$[0,1]\times [0,1]$$, this fraction is going to be, on average, equal to the are of a square of length $$\lambda$$, which in this case is just $$0.1^2 = 0.01$$.

Let us plot the distribution.

sim.2%>%
as_tibble() %>%
ggplot() +
theme_light() +
geom_histogram(mapping = aes(x = value),
fill = 'blue',
alpha = 0.7,
bins = 20) +
geom_vline(xintercept = mean.sim.2) +
geom_vline(xintercept = lambda^p, color = 'red') +
ggtitle(label = glue('Fraction Simulation Distribution for p = {p}'))

1. Now suppose that we have a set of observations on $$p = 100$$ features. Again the observations are uniformly distributed on each feature, and again each feature ranges in value from $$0$$ to $$1$$. We wish to predict a test observation’s response using observations within the $$10$$% of each feature’s range that is closest to that test observation. What fraction of the available observations will we use to make the prediction?

We run the simulation first for $$p=5$$.

p <- 5

sim.p <- RunSimulationP(lambda = lambda,
p = p,
n0 = 100,
n1 = 100,
N = 100)

mean.sim.p <- mean(sim.p)

Let us plot the distribution.

sim.p %>%
as_tibble() %>%
ggplot() +
theme_light() +
geom_histogram(mapping = aes(x = value),
fill = 'blue',
alpha = 0.7,
bins = 20) +
geom_vline(xintercept = mean.sim.p) +
geom_vline(xintercept = lambda^p, color = 'red') +
ggtitle(label = glue('Fraction Simulation Distribution for p = {p}'))

For $$p=100$$ we will just get zero (a very small number).

1. Using your answers to parts (a) - (c), argue that a drawback of KNN when $$p$$ is large is that there are very few training observations “near” any given test observation.

As we have seen from (a) - (c), on average, the fraction of available observations used to make the prediction equals the volume of a $$p$$-dimensional hypercube of size $$\lambda$$, which equals $$\lambda^p$$. As this volume satisfies,

$\lim_{p\rightarrow \infty} \lambda^p = 0,$ this means that when $$p$$ is large there are very few training observations “near” any given test observation. This fact makes the prediction not robust.

1. Now suppose that we wish to make a prediction for a test observation by creating a $$p$$-dimensional hypercube centered around the test observation that contains, on average, $$10$$% of the training observations. For $$p = 1,2$$, and $$100$$, what is the length of each side of the hypercube?

In this case we want to solve for $$\lambda$$ in the equation $$\lambda^p=0.1$$, i.e. $$\lambda = 0.1^{1/p}$$.

• $$p=1$$
p <- 1

0.1^{1/p}
## [1] 0.1
• $$p=2$$
p <- 2

0.1^{1/p}
## [1] 0.3162278
• $$p=100$$
p <- 100

0.1^{1/p}
## [1] 0.9772372

Let us see the values of lambda as a function of $$p$$.

tibble(Dimension = 1:100) %>%
mutate(Lambda = 0.1^(1/Dimension)) %>%
ggplot() +
theme_light() +
geom_line(mapping = aes(x = Dimension, y = Lambda)) +
geom_hline(yintercept = 1, color = 'red')

That is, the side of the hypercube should grow almost to 1! This means that the “locality” is lost.

Remark: It is interesting to see that, even if the volume of the $$\lambda$$-hypercube tends to zero when $$p$$ is large, the length of its diagonal (defined as the distance between two oposite corners) grows to $$+\infty$$. Indeed, it is easy to verify that the diagonal length is $$\lambda \sqrt{p}$$ (e.g. using induction).

tibble(Dimension = 1:1000) %>%
mutate(Diagonal = sqrt(Dimension)*lambda) %>%
ggplot() +
theme_light() +
geom_line(mapping = aes(x = Dimension, y = Diagonal))