Open Data: Berlin Kitas - Dr. Juan Camilo Orduz

In this notebook I want to explore some data I found on the Berlin Open Data portal daten.berlin.de. The data source contains information of Kitas (Kindertagesstätte, i.e. kindergartens) in Berlin. This is a big topic as finding a spot in a Kita in Berlin is extremely difficult. We first provide an initial exploratory data analysis of the data set, then we merge it with population data to create some geo-location maps.

Disclaimer: In our way we do not expect to answer on how to find a Kita, but rather simpler questions about their location, distribution and sizes.

Prepare Notebook

import numpy as np
import pandas as pd
import geopandas as gpd
import mplleaflet
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import seaborn as sns
sns.set_style(
    style='darkgrid', 
    rc={'axes.facecolor': '.9', 'grid.color': '.8'}
)
sns.set_palette(palette='deep')
sns_c = sns.color_palette(palette='deep')
%matplotlib inline
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['figure.dpi'] = 100

Read Data

Let us load the data from the orignial Excel file (which I store locally).

kitas_raw_df = pd.read_excel(
    io='../Data/kitaliste_aktuell.xlsx', 
    skiprows=4, 
    dtype={'PLZ': str}
)

kitas_raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2739 entries, 0 to 2738
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Betreuungsbezirk Nr  2739 non-null   int64  
 1   Betreuungsbezirk     2739 non-null   object 
 2   Einrichtungsnummer   2739 non-null   int64  
 3   Einrichtungsname     2739 non-null   object 
 4   Einrichtungsadresse  2739 non-null   object 
 5   PLZ                  2739 non-null   object 
 6   Ort                  2739 non-null   object 
 7   Telefon              2708 non-null   object 
 8   Anzahl Plätze        2736 non-null   float64
 9   Einrichtungsart      2739 non-null   object 
 10  Trägernummer         2739 non-null   int64  
 11  Trägername           2739 non-null   object 
 12  Trägerart            2739 non-null   object 
dtypes: float64(1), int64(3), object(9)
memory usage: 278.3+ KB

Format Data

Next, we format the data and prepare it for the analysis.

# We verify there is just one city.
kitas_raw_df['Ort'].unique()

array(['Berlin'], dtype=object)

# There is a 1-1 correspondece between `Betreuungsbezirk Nr` and `Betreuungsbezirk`.
kitas_raw_df[['Betreuungsbezirk Nr', 'Betreuungsbezirk']].drop_duplicates()

	Betreuungsbezirk Nr	Betreuungsbezirk
0	1	Mitte
329	2	Friedrichshain-Kreuzberg
620	3	Pankow
994	4	Charlottenburg-Wilmersdorf
1261	5	Spandau
1403	6	Steglitz-Zehlendorf
1602	7	Tempelhof-Schöneberg
1863	8	Neukölln
2096	9	Treptow-Köpenick
2287	10	Marzahn-Hellersdorf
2420	11	Lichtenberg
2581	12	Reinickendorf

# Rename columns from German to English.
rename_cols = {
    'Betreuungsbezirk': 'district',
    'Einrichtungsnummer': 'id',
    'Einrichtungsname': 'name',
    'Einrichtungsadresse': 'address',
    'PLZ': 'plz',                  
    'Telefon': 'telephone',       
    'Anzahl Plätze': 'spots'  ,
    'Einrichtungsart': 'type',
    'Trägernummer': 'carrier_number',
    'Trägername': 'carrier_name',  
    'Trägerart': 'carrier_type',
}

columns_to_drop = [
    'Betreuungsbezirk Nr', 
    'Ort',
]

# Format data: remove redundant columns, rename columns and add new features.
kitas_df = kitas_raw_df \
    .copy() \
    .drop(columns_to_drop, axis=1) \
    .rename(columns=rename_cols) \
    .assign(
        num_kitas_plz=lambda x: x.groupby(['district', 'plz'])['id'].transform('count'),
        spots_plz=lambda x: x.groupby(['district', 'plz'])['spots'].transform(np.sum)
    )
    
kitas_df.head()

	district	id	name	address	plz	telephone	spots	type	carrier_number	carrier_name	carrier_type	num_kitas_plz	spots_plz
0	Mitte	1010010	Kita F.A.I.R.play	Albrechtstr. 020	10117	281 64 73	69.0	Kindertagesstätte	1224	GFJ - gemeinnützige Gesellschaft für Familien-…	Sonstiger freier Träger	21	1550.0
1	Mitte	1010020	Kita Kinderwelt	An der Kolonnade 003-5	10117	2291378	155.0	Kindertagesstätte	1334	Forum Soziale Dienste GmbH	Sonstiger freier Träger	21	1550.0
2	Mitte	1010030	FRÖBEL Kindergarten Casa Fantasia	Anklamer Str. 038	10115	4498171	69.0	Kindertagesstätte	1227	Fröbel Bildung und Erziehung gGmbH	Sonstiger freier Träger	24	1807.0
3	Mitte	1010080	Kita Regenbogen	Fehrbelliner Str. 080	10119	449 32 38	91.0	Kindertagesstätte	1202	Pfefferwerk Stadtkultur gGmbH	Sonstiger freier Träger	11	1150.0
4	Mitte	1010100	FRÖBEL Kindergarten “Schatzinsel”	Fischerinsel 008	10179	201 37 88	241.0	Kindertagesstätte	1227	Fröbel Bildung und Erziehung gGmbH	Sonstiger freier Träger	10	939.0

Exploratory Data Analysis

One of the main objectives of this notebook is to do an exploratory data analysis to understand which questions this data set can answer. In addition, also determine its limitations. To begin, let us get the number of unique values per feature.

kitas_df.apply(lambda x: x.unique().size, axis=0)

    district            12
    id                2739
    name              2597
    address           2707
    plz                190
    telephone         2601
    spots              200
    type                 4
    carrier_number    1224
    carrier_name      1224
    carrier_type         8
    num_kitas_plz       37
    spots_plz          194
    dtype: int64

# Dimensions data set.
kitas_df.shape

(2739, 13)

We have 2739 Kitas in our data set and the id column is indeed an unique identifier of each Kita. The name, on the other hand, is not.

kitas_df \
    .groupby(['name']) \
    .agg(num_kitas=('id', 'count')) \
    .query('num_kitas > 1') \
    .sort_values('num_kitas', ascending=False) \
    .head()

	num_kitas
name
Kleiner Fratz	18
IB-Kita, Internationaler Bund	8
Kita/Kinder in Bewegung (KiB)	6
Kita Sonnenschein	5
Kita Kinder in Bewegung (KiB)	5

Kitas per District

We start by looking into the first layer of geo-location granularity: district (there are 12 districts in Berlin).

fig, ax = plt.subplots()
kitas_df \
    .groupby(['district']) \
    .agg(n=('id', 'count')) \
    .reset_index(drop=False) \
    .sort_values('n', ascending=False) \
    .pipe((sns.barplot, 'data'), 
        x='n', 
        y='district',
        color=sns_c[0],
        edgecolor=sns_c[0],
        ax=ax
    )
ax.set(
    title='Number of Kitas per District in Berlin', 
    xlabel='number of kitas', 
    ylabel='district'
);

We generate the same plot but showing the shares instead.

fig, ax = plt.subplots()
kitas_df \
    .groupby(['district']) \
    .agg(n=('id', 'count')) \
    .assign(share= lambda x: x['n'] / x['n'].sum()) \
    .reset_index(drop=False) \
    .sort_values('n', ascending=False) \
    .pipe((sns.barplot, 'data'), 
        x='share', 
        y='district',
        color=sns_c[1],
        edgecolor=sns_c[1],
        ax=ax
    )
ax.xaxis.set_major_formatter(mtick.FuncFormatter(lambda y, _: '{:0.0%}'.format(y)))
ax.set(
    title='Share of Kitas per District in Berlin', 
    xlabel='share of kitas', 
    ylabel='district'
);

Pankow is a rather big district compared to Mitte and Friedrichshain-Kreuzberg area-wise. In order to run a fair comparison across districts, we need to control by size or population (which we will do bellow).

Kitas per PLZ

Now we dig deeper into plz level. A natural question is: How many Kitas each plz has? To answer it let us look into the distribution (again, in order to make a fair comparison we need to control by plz population or size).

fig, ax = plt.subplots()
kitas_df \
    .groupby(['district', 'plz']) \
    .agg(n=('name', 'count')) \
    .pipe((sns.histplot, 'data'),
        x='n', 
        bins=25, 
        kde=True,
        ax=ax
    )
ax.set(
    title='Number of Kitas per PLZ', 
     xlabel='number of kitas per plz',
     xlim=(0, None)
);

We see there is a long tail. Let us see the plz with more Kitas:

fig, ax = plt.subplots(figsize=(12, 6))
kitas_df \
    .groupby(['district', 'plz']) \
    .agg(n=('id', 'count')) \
    .reset_index(drop=False) \
    .assign(plz = lambda x: 'plz_' + x['plz']) \
    .sort_values('n', ascending=False) \
    .head(40) \
    .pipe((sns.barplot, 'data'),  
        x='plz',
        y='n',
        hue='district',
        dodge=False,
        ax=ax
    )
ax.tick_params(axis='x', labelrotation=90)
ax.set(
    title='Number of Kitas per PlZ in Berlin (Top 40)', 
     xlabel='plz', 
     ylabel='number of kitas', 
);

Let us locate plz = 10437 using Google maps.

This is a very trendy area in Berlin, specially for young families.

Next, let us plot the distribution of the number of Kitas per plz split by district.

g = kitas_df \
    .groupby(['district', 'plz']) \
    .agg(n=('name', 'count')) \
    .pipe((sns.displot, 'data'), 
        x='n',
        hue='district',
        kind='kde',
        rug=True, 
        height=5,
        aspect=1.3,
        palette='Paired'
    )
ax = g.axes.flatten()[0]
ax.set(
    title='Number of Kitas per PlZ', 
     xlabel='number of kitas per plz',
     xlim=(0, None)
);

fig, ax = plt.subplots(figsize=(12, 6))
kitas_df \
    .groupby(['district', 'plz']) \
    .agg(n=('id', 'count')) \
    .reset_index(drop=False) \
    .sort_values('district') \
    .pipe((sns.boxplot, 'data'), 
        x='district', 
        y='n',
        color=sns_c[0],
        saturation=1.0,
        ax=ax
    )
ax.tick_params(axis='x', labelrotation=90)
ax.set(
    title='Number of Kitas per PlZ in Berlin within each District', 
     ylabel='number of kitas per plz', 
);

It seems that, on average, Friedrichshain-Kreuzberg have more Kitas per plz and Pankow has the highest variance (again, we need to control by plz size to be more precise).

Spots per Kita

Let us now analyze the number of spots data.

# Some descriptive stats.
kitas_df['spots'].describe()

count 2736.000000 mean 67.617690 std 55.624663 min 0.000000 25% 25.000000 50% 45.000000 75% 95.000000 max 320.000000 Name: spots, dtype: float64

Note that there is a Kita wih spots=0. I am not sure about the reason behind this.

kitas_df.query('spots == 0')

	district	id	name	address	plz	telephone	spots	type	carrier_number	carrier_name	carrier_type	num_kitas_plz	spots_plz
1122	Charlottenburg-Wilmersdorf	4073100	Dickenskita	Dickensweg 017-19	14055	35109330	0.0	Kindertagesstätte	8435	Berlin British School gGmbH	Sonstiger freier Träger	11	500.0

We plot the spots distribution:

spots_mean = kitas_df['spots'].describe()['mean']
spots_median = kitas_df['spots'].describe()['50%']

fig, ax = plt.subplots()
sns.histplot(x='spots', data=kitas_df, kde=True, color=sns_c[3], ax=ax)
ax.axvline(x=spots_mean, color=sns_c[1], linestyle='--', label=f'mean = {spots_mean: 0.0f}')
ax.axvline(x=spots_median, color=sns_c[4], linestyle='--', label=f'median = {spots_median: 0.0f}')
ax.legend(loc='upper right')
ax.set(
    title='Number of Spots per Kita', 
     xlabel='number of spots per kita',
     xlim=(0, None)
);

Here is a list of the top 20 kitas with more spots:

kitas_df \
    [['name', 'district', 'plz', 'spots']] \
    .sort_values('spots', ascending=False) \
    .head(20)

	name	district	plz	spots
54	BCS Kindergarten Preschool	Mitte	10115	320.0
2441	Kita Lichtenzwerge	Lichtenberg	10315	300.0
2428	Kita Bärenkinder	Lichtenberg	10319	300.0
2543	Kita Neustrelitzer Str. 32-34/Kigä NordOst	Lichtenberg	13055	285.0
2113	FRÖBEL Kindergarten-im-Grünen	Treptow-Köpenick	12487	282.0
647	Klax Krippen Regentropfenhaus und Wolkenzwerge…	Pankow	10439	280.0
1885	Kita Mini-Mix-International	Neukölln	12053	280.0
351	FRÖBEL Kindergarten Am Ring	Friedrichshain-Kreuzberg	10247	272.0
14	FRÖBEL Kindergarten “Traumzauberbaum”	Mitte	10178	260.0
2590	Kita Treuenbrietzener Straße	Reinickendorf	13439	252.0
2541	Kita Nido Piccolo	Lichtenberg	13059	250.0
2566	Kita Tausendfüßler	Lichtenberg	13055	250.0
329	Kita Spiel- und Erlebniswelt	Friedrichshain-Kreuzberg	10243	248.0
346	FRÖBEL Kindergarten FRÖBELSPATZEN	Friedrichshain-Kreuzberg	10243	246.0
864	Kita Am Brennerberg	Pankow	13187	245.0
4	FRÖBEL Kindergarten “Schatzinsel”	Mitte	10179	241.0
2115	FRÖBEL Kindergarten Wirbelwind	Treptow-Köpenick	12435	240.0
335	Kita Gryphiusstr. 34-35	Friedrichshain-Kreuzberg	10245	240.0
2460	Kita Bunte Plonzstifte	Lichtenberg	10365	240.0
901	Klax Krippe Sonnenhaus,Klax Kindergarten Wolke…	Pankow	13189	240.0

Let us see the spots distributoin per district:

g = kitas_df \
    .pipe((sns.displot, 'data'), 
        x='spots',
        hue='district',
        kind='kde',
        rug=True, 
        height=5,
        aspect=1.3,
        palette='Paired'
    )
ax = g.axes.flatten()[0]
ax.set(
    title='Number of Spots per Kita', 
    xlabel='number of spots', 
    xlim=(0, None)
);

fig, ax = plt.subplots(figsize=(12, 6))
kitas_df \
    .sort_values('district') \
    .pipe((sns.boxplot, 'data'), 
        x='district', 
        y='spots', 
        color=sns_c[3],
        saturation=1.0,
        ax=ax
    )
ax.axhline(y=kitas_df['spots'].describe()['mean'], color=sns_c[1], linestyle='--', label='mean')
ax.axhline(y=kitas_df['spots'].describe()['50%'], color=sns_c[4], linestyle='--', label='median')
ax.tick_params(axis='x', labelrotation=90)
ax.legend(loc='upper left')
ax.set(
    title='Number of Spots per Kita', 
     ylabel='number of spots', 
);

Note that Pankow, Mitte and Friedrichshain-Kreuzberg have a lowe number of spots per Kita (these are the top 3 districts with mote Kitas). We want to see if there is a linear relation between the number of Kitas per plz and the number (median) of spots per Kita.

g = kitas_df \
    .groupby(['district', 'plz']) \
    .agg({'num_kitas_plz': np.mean, 'spots': np.median}) \
    .pipe((sns.jointplot, 'data'), 
        x='num_kitas_plz',  
        y='spots',
        kind='reg', 
        order=1,
        height=8,
        ratio=3, 
        marginal_kws=dict(color=sns_c[1])
    )
g.plot_joint(sns.kdeplot, zorder=0, levels=6, color=sns_c[3], alpha=0.7)
g.set_axis_labels(xlabel='number of kitas per plz', ylabel='median spots per plz')
g.fig.suptitle('Number of Kitas vs Spots per PLZ', y=1.03);

It seems there is a negative relation between them, i.e. plz with more Kitas seem to have less spots per Kita. Let us plot this per district:

g = kitas_df \
    .groupby(['district', 'plz'], as_index=False) \
    .agg({'num_kitas_plz': np.mean, 'spots': np.median}) \
    .pipe((sns.FacetGrid, 'data'),
        col='district', 
        col_wrap=4, 
        height=3,
    )
g.map(sns.regplot, 'num_kitas_plz', 'spots')
g.fig.suptitle('Number of Kitas vs Spots per PLZ', y=1.03);

Carriers

Now we want to study the carrier feature. First, let us count the number of kitas per carrier_type.

fig, ax = plt.subplots()
kitas_df \
    .groupby(['carrier_type']) \
    .agg(n=('id', 'count')) \
    .reset_index() \
    .sort_values('n', ascending=False) \
    .assign(share= lambda x: x['n'] / x['n'].sum()) \
    .pipe((sns.barplot, 'data'), 
        x='share', 
        y='carrier_type',
        color=sns_c[2],
        edgecolor=sns_c[2],
        ax=ax
    )
fmt = lambda y, _: f'{y :0.0%}'
ax.xaxis.set_major_formatter(mtick.FuncFormatter(fmt))
ax.set(
    title='Number of Kitas per Carrier Type', 
    xlabel='number of kitas', 
    ylabel='carrier type'
);

Here are some explanations on the carrier_type:

Arbeiterwohlfahrt: Workers welfare
Betriebskita: Company daycare
Deutscher Caritasverband: German Caritas Association
Eigenbetrieb: Own operation
EKT (Eltern-Initiativ-Kindertagesstätte): Parents initiative.
Diakonisches Werk: Diaconal work
private Kita ohne Abschluss der Rahmenvereinbarung: private daycare without signing the framework agreement
Sonstiger freier Träger: Other private carrier

# Note that there is. a 1-1 correspondence between `carrier_number` and `carrier_name`.
kitas_df \
    .groupby('carrier_number')\
    .agg({'carrier_name': 'nunique'}) \
    .query('carrier_name > 1') \
    .shape

(0, 1)

Now we plot the carrier_type distriibution per district.

fig, ax = plt.subplots(figsize=(15, 6))
kitas_df \
    .groupby(['district', 'carrier_type']) \
    .agg(n=('id', 'count')) \
    .reset_index(drop=False) \
    .sort_values('n', ascending=False) \
    .pipe((sns.barplot, 'data'), 
        y='n', 
        x='district',
        hue='carrier_type',
        edgecolor='black',
        dodge=True,
        ax=ax
    )
ax.tick_params(axis='x', labelrotation=90)
ax.set(
    title='Number of Kitas per District in Berlin (per Carrier Type)', 
    xlabel='district', 
    ylabel='number of kitas'
);

We can also compute carrier_type share per district:

fig, ax = plt.subplots()
cmap = sns.light_palette(sns_c[0])
fmt = lambda y, _: f'{y :0.0%}'
kitas_df \
    .groupby(['district', 'carrier_type']) \
    .agg(n=('id', 'count')) \
    .reset_index(drop=False) \
    .assign(
        num_kitas_district = lambda x: x.groupby(['district'])['n'].transform(np.sum),
        share = lambda x: x['n'].div(x['num_kitas_district'])
    ) \
    .pivot(index='district', columns='carrier_type', values='share') \
    .fillna(0.0) \
    .pipe((sns.heatmap, 'data'), 
        vmin=0.0,
        vmax=1.0,
        cmap=cmap,
        linewidths=0.2, 
        linecolor='black',
        annot=True, 
        fmt='0.2%',
        cbar_kws={'format': mtick.FuncFormatter(fmt)},
        ax=ax
    )
ax.set(title='Carrier Type Share per District');

It is very interesting to see how high the share of EKT in Charlottenburg-Wilmersdorf and Friedrichshain-Kreuzberg.

Now we want to investigate the carriers coverage on the number of kitas. The questions we want to answer are:

Do carriers with more than one Kita dominate the Kita distribution?
Do we see a difference across districts?

To answer both question we need to compute how many Kitas each carrier has. In addition,we sort them out by the number of Kitas calculate the (cumulative) share on the number of Kitas. This will allow us to answer the first question.

carrier_df = kitas_df \
    .groupby(['carrier_name']) \
    .agg(num_kitas=('name', pd.Series.nunique)) \
    .sort_values('num_kitas', ascending=False) \
    .assign(
        single_kita = lambda x: x['num_kitas'] < 2,
        share = lambda x: x['num_kitas'] / x['num_kitas'].sum(),
        cumsum_share = lambda x: x['share'].cumsum(),
        idx= lambda x: range(x.shape[0])
    )

carrier_df.head()

	num_kitas	single_kita	share	cumsum_share	idx
carrier_name
Kindergärten NordOst	77	False	0.029079	0.029079	0
Kindertagesstätten Nordwest Eigenbetrieb von Berlin	63	False	0.023792	0.052870	1
Kindergärten City	56	False	0.021148	0.074018	2
Kindertagesstätten SüdOst Eigenbetrieb von Berlin	44	False	0.016616	0.090634	3
Kindertagesstätten Berlin Süd-West	37	False	0.013973	0.104607	4

Let us compute the distribution of carriers with a single Kita vs carriers with more than one Kita.

carrier_df \
    .groupby('single_kita')\
    .agg(n=('idx', 'count')) \
    .assign(share=lambda x: x['n'].div(x['n'].sum()))

	n	share
single_kita
False	365	0.298203
True	859	0.701797

Next we plot the cumulative share data on the number of Kitas:

k = 200 
k_cumsum_share = carrier_df.query(f'idx == {k - 1}')['cumsum_share'].values[0]
idx_single_kita = carrier_df.query('single_kita').iloc[0]['idx']
cumsum_share_single_kita = carrier_df.query('single_kita').iloc[0]['cumsum_share']

fig, ax = plt.subplots()
sns.lineplot(x='idx', y='cumsum_share', data=carrier_df, color=sns_c[0], label='cumulative sum share', ax=ax)
ax.axvline(x=idx_single_kita, color=sns_c[4], linestyle='--', label=f'{idx_single_kita} carriers have more than one Kita ({cumsum_share_single_kita: 0.2%})')
ax.axhline(y=cumsum_share_single_kita, color=sns_c[4], linestyle='--')
ax.axvline(x=k, color=sns_c[3], linestyle='--', label=f'{k} carriers account for {k_cumsum_share: 0.2%} of the Kitas')
ax.axhline(y=k_cumsum_share , color=sns_c[3], linestyle='--')
ax.legend(loc='lower right')
ax.set(
    title='Cumulative Share of Top Kita Carriers', 
    xlabel='carrier rank (based on number of Kitas)',
    ylabel='cumsum share'
);

We have two main takeaways from this plot:

356 carriers, which are ~ 30% of the total number of carriers, cover ~ 67% of the number of Kitas.
The top 200 carriers, which are ~ 16% of the total number of carriers, cover more than half (~ 55%) of the number of Kitas.

Hence, the carriers. with more than one Kita have a very high coverage.

# We merge the carriers data to. the Kitas data.
kitas_df = pd.merge(
    left=kitas_df, 
    right=carrier_df['single_kita'].reset_index(), 
    on='carrier_name', 
    how='left'
)

Now let us answer the second question by looking into the districts.

fig, ax = plt.subplots()
kitas_df \
    .groupby(['district', 'single_kita']) \
    .agg(n=('id', 'count')) \
    .reset_index(drop=False) \
    .sort_values('n', ascending=False) \
    .pipe((sns.barplot, 'data'), 
        x='n', 
        y='district',
        hue='single_kita',
        edgecolor='black',
        palette=[sns_c[3], sns_c[0]],
        ax=ax
    )
ax.set(
    title='Number of Kitas per District in Berlin (Carrier with/without Single Kita)', 
    xlabel='number of kitas', 
    ylabel='district'
);

We now plot the relative share over districts of number of Kitas ans spots (which are of course correlated):

carrier_kitas_df = kitas_df \
    .groupby(['district', 'single_kita']) \
    .agg(n=('id', 'count'), spots=('spots', np.sum)) \
    .reset_index(drop=False) \
    .sort_values('district') \
    .assign(
        total_n = lambda x: x.groupby('district')['n'].transform(np.sum),
        total_spots = lambda x: x.groupby('district')['spots'].transform(np.sum),
        share_n = lambda x: x['n'].div(x['total_n']),
        share_spots = lambda x: x['spots'].div(x['total_spots']),
    )

fig, ax = plt.subplots(1, 2, constrained_layout=True)
cmap0 = sns.light_palette(sns_c[1])
cmap1 = sns.light_palette(sns_c[4])
fmt = lambda y, _: f'{y :0.0%}'
carrier_kitas_df \
    .pivot(index='district', columns='single_kita', values='share_n') \
    .pipe((sns.heatmap, 'data'), 
        vmin=0.0,
        vmax=1.0,
        cmap=cmap0,
        linewidths=0.2, 
        linecolor='black',
        annot=True, 
        fmt='0.2%',
        cbar_kws={'format': mtick.FuncFormatter(fmt)},
        ax=ax[0]
    )

carrier_kitas_df \
    .pivot(index='district', columns='single_kita', values='share_spots') \
    .pipe((sns.heatmap, 'data'), 
        vmin=0.0,
        vmax=1.0,
        cmap=cmap1,
        linewidths=0.2, 
        linecolor='black',
        annot=True, 
        fmt='0.2%',
        cbar_kws={'format': mtick.FuncFormatter(fmt)},
        ax=ax[1]
    )

ax[0].set(title='Number of Kitas Share', xlabel='carrier single kita')
ax[1].set(title='Number of Spots Share', xlabel='carrier single kita', ylabel='');

In Charlottenburg-Wilmersdorf the share of carriers with/without single Kitas is quite balanced whereas in Marzahn-Hellersdorf and Treptow-Köpenick is quite unbalanced.

This ends the an initial exploratory data analysis of the Kitas data. There are still many interesting things to look, but we will leave them for a second iteration.

Maps & Population Data

Next we enrich out data set by including information about population per plz. In addition, we use maps to visualize some metrics. Please refer to the blog post Open Data: Germany Maps Viz for the description of the data sources.

Geo-location Data: geo-pandas dataframe with plz geometries.

plz_shape_df = gpd.read_file('../Data/plz-gebiete.shp', dtype={'plz': str})

plz_shape_df.head()

	plz	note	geometry
0	52538	52538 Gangelt, Selfkant	POLYGON ((5.86632 51.05110, 5.86692 51.05124, …
1	47559	47559 Kranenburg	POLYGON ((5.94504 51.82354, 5.94580 51.82409, …
2	52525	52525 Waldfeucht, Heinsberg	POLYGON ((5.96811 51.05556, 5.96951 51.05660, …
3	52074	52074 Aachen	POLYGON ((5.97486 50.79804, 5.97495 50.79809, …
4	52531	52531 Ãbach-Palenberg	POLYGON ((6.01507 50.94788, 6.03854 50.93561, …

Population Data

plz_einwohner_df = pd.read_csv(
    '../Data/plz_einwohner.csv',
    sep=',',
    dtype={'plz': str, 'einwohner': int}
)

plz_einwohner_df.head()

	plz	einwohner
0	01067	11957
1	01069	25491
2	01097	14811
3	01099	28021
4	01108	5876

Merge Data

We merge these data sets on plz.

plz_df = pd.merge(
    left=plz_shape_df[['plz', 'geometry']],
    right=plz_einwohner_df,
    on='plz',
    how='left'
)
# Merge with Kitas data.
plz_df = pd.merge(
    left=plz_df,
    right=kitas_df[['district', 'plz', 'num_kitas_plz', 'spots_plz']].drop_duplicates(),
    on='plz',
    how='inner'
)
# Add features: number of kitas per plz per capita (and its log-transform).
plz_df = plz_df.assign(
    num_kitas_plz_pc = lambda x: x['num_kitas_plz'].div(x['einwohner']), 
    num_kitas_plz_pc_log = lambda x: np.log(x['num_kitas_plz_pc'])
)

plz_df.head()

	plz	geometry	einwohner	district	num_kitas_plz	spots_plz	num_kitas_plz_pc	num_kitas_plz_pc_log
0	14109	POLYGON ((13.08835 52.41963, 13.09584 52.42198…	10049	Steglitz-Zehlendorf	8	357.0	0.000796	-7.135787
1	14089	POLYGON ((13.10929 52.45063, 13.10956 52.45108…	17734	Spandau	15	1112.0	0.000846	-7.075189
2	13591	POLYGON ((13.11738 52.51706, 13.11811 52.52010…	26762	Spandau	16	1595.0	0.000598	-7.422150
3	13587	POLYGON ((13.12796 52.58313, 13.12934 52.58593…	20108	Spandau	14	1113.0	0.000696	-7.269816
4	13593	POLYGON ((13.14288 52.52181, 13.14306 52.52179…	20238	Spandau	9	997.0	0.000445	-7.718093

First, let us visualize the plz per district in Berlin (we plot both static and dynamic maps).

fig, ax = plt.subplots(figsize=(12, 9))
plz_df.plot(
    ax=ax,
    column='district',
    categorical=True,
    cmap='Paired',
    edgecolor='black',
    linewidth=0.3,
    legend=True,
    legend_kwds={'title':'Districts', 'loc': 'center left', 'bbox_to_anchor': (1, 0.5)},
)
ax.set(
    title='Berlin PLZ',
    aspect=1.3
);

fig, ax = plt.subplots(figsize=(12, 9))
plz_df.plot(
    ax=ax,
    column='district',
    categorical=True,
    cmap='Paired',
    edgecolor='black',
    linewidth=0.3,
)
mplleaflet.display(fig=fig)

We clearly see that the districts size in area are very different. Note for example how small Friedrichshain-Kreuzberg and Mitte (both in top 3 districs with more Kitas) are as compared with Treptow-Köpenick.

Next we plot the number if inhabitants per plz.

fig, ax = plt.subplots(figsize=(15, 10))
plz_df.plot(
    ax=ax,
    column='einwohner',  
    categorical=False, 
    cmap='plasma_r',
    edgecolor='black',
    linewidth=0.3
)
ax.set(
    title='Berlin - Population per PLZ',
    aspect=1.3
);

fig, ax = plt.subplots(figsize=(15, 10))
plz_df.plot(
    ax=ax,
    column='einwohner',  
    categorical=False, 
    cmap='plasma_r',
    edgecolor='black',
    linewidth=0.3
)
mplleaflet.display(fig=fig)

Now we plot the number of Kitas per plz.

fig, ax = plt.subplots(figsize=(15, 10))
plz_df.plot(
    ax=ax,
    column='num_kitas_plz',
    categorical=False,
    cmap='viridis_r',
    edgecolor='black',
    linewidth=0.3
)
ax.set(
    title='Berlin - Number of Kitas per PLZ',
    aspect=1.3
);

fig, ax = plt.subplots(figsize=(15, 10))
plz_df.plot(
    ax=ax,
    column='num_kitas_plz',
    categorical=False,
    cmap='viridis_r',
    edgecolor='black',
    linewidth=0.3
)
mplleaflet.display(fig=fig)

As we see, comparing plz data without controlling for population can lead to misleading conclusions. Let us not compute the number of Kitas per capita per district.

fig, ax = plt.subplots()
plz_df \
    .groupby(['district'], as_index=False) \
    .agg({'einwohner': np.sum, 'num_kitas_plz': np.sum}) \
    .assign(num_kitas_district_pc = lambda x: x['num_kitas_plz'].div(x['einwohner'])) \
    .sort_values('num_kitas_district_pc', ascending=False) \
    .pipe((sns.barplot, 'data'),
        x='num_kitas_district_pc',
        y='district',
        color=sns_c[4],
        edgecolor=sns_c[4],
        ax=ax,
    )
ax.set(
    title='Number of Kitas per District per Capita in Berlin',
    xlabel='number of kitas per capita',
    ylabel='district',
);

We see that the top 3 districts remain, but now Friedrichshain-Kreuzberg and Mitte are at the top.

Next, we plot the number of Kitas per capita per plz (log-transform).

fig, ax = plt.subplots(figsize=(15, 10))
plz_df.plot(
    ax=ax,
    column='num_kitas_plz_pc_log',
    categorical=False,
    cmap='viridis_r',
    edgecolor='black',
    linewidth=0.3
)
ax.set(
    title='Berlin - (log) Number of Kitas per PLZ per Capita',
    aspect=1.3
);

We immediately see that there is a rather small plz with a high (log) number of Kitas. Let us find which one it is:

plz_df.sort_values('num_kitas_plz_pc', ascending=False).drop('geometry', axis=1).head()

	plz	einwohner	district	num_kitas_plz	spots_plz	num_kitas_plz_pc	num_kitas_plz_pc_log
18	14053	139	Charlottenburg-Wilmersdorf	3	70.0	0.021583	-3.835862
44	10709	8771	Charlottenburg-Wilmersdorf	18	824.0	0.002052	-6.188834
55	12165	4803	Steglitz-Zehlendorf	9	508.0	0.001874	-6.279771
146	10997	23800	Friedrichshain-Kreuzberg	42	1707.0	0.001765	-6.339771
68	10715	15464	Charlottenburg-Wilmersdorf	27	1027.0	0.001746	-6.350433

We not locate the plz = 14053.

This plz does not have many inhabitants, as it mainly encloses the Olympiastadion. We simply remove this data point from the map inorder to see a more faithful picture of the number of Kitas per capita per plz.

fig, ax = plt.subplots(figsize=(15, 10))
plz_df.query('plz != "14053"').plot(
    ax=ax,
    column='num_kitas_plz_pc',
    categorical=False,
    cmap='viridis_r',
    edgecolor='black',
    linewidth=0.3
)
ax.set(
    title='Berlin - Number of Kitas per PLZ per Capita',
    aspect=1.3
);

fig, ax = plt.subplots(figsize=(15, 10))
plz_df.query('plz != "14053"').plot(
    ax=ax,
    column='num_kitas_plz_pc',
    categorical=False,
    cmap='viridis_r',
    edgecolor='black',
    linewidth=0.3
)
mplleaflet.display(fig=fig)

Number of Kitas, Population, Spots and Districts Relations

Finally, let us study the relations between number of kitas, spots and population per plz. These variables are naturally correlated:

corr_mat = plz_df[['einwohner', 'num_kitas_plz', 'spots_plz']].corr()
fig, ax = plt.subplots(figsize=(6, 6))
cmap = sns.diverging_palette(10, 220, as_cmap=True)
sns.heatmap(
    data=corr_mat, 
    vmin=-1.0, 
    vmax=1.0, 
    center=0, 
    cmap=cmap, 
    square=True,
    linewidths=0.5, 
    linecolor='k',
    annot=True, 
    fmt='.3f',
    ax=ax
)
ax.set_yticklabels(ax.get_yticklabels(), rotation=0, horizontalalignment='right')
ax.set_xticklabels(ax.get_yticklabels(), horizontalalignment='center')
ax.set(title='Correlarion Matrix');

We can visualize them in a scatter plot.

fig, ax = plt.subplots(figsize=(10, 8))
sns.scatterplot(
    x='einwohner',
    y='num_kitas_plz',
    data=plz_df,
    hue='spots_plz',
    palette='viridis_r',
    edgecolor='black',
    linewidth=0.5,
    ax=ax
)
ax.legend(**{'loc': 'center left', 'bbox_to_anchor': (1, 0.5)})
ax.set(
    title='Number of Kitas vs Inhabitants per PLZ',
    xlabel='population',
    ylabel='number of kitas'
);

Let us encode the district as color and spots as size.

fig, ax = plt.subplots(figsize=(10, 8))
sns.scatterplot(
    x='einwohner',
    y='num_kitas_plz',
    data=plz_df,
    hue='district',
    palette='Paired',
    edgecolor='black',
    linewidth=0.5,
    size='spots_plz',
    ax=ax
)
ax.legend(**{'loc': 'center left', 'bbox_to_anchor': (1, 0.5)})
ax.set(
    title='Number of Kitas vs Inhabitants per PLZ',
    xlabel='population',
    ylabel='number of kitas'
);

Another way to represent their relation is using a 3D-plot.

from mpl_toolkits.mplot3d import Axes3D
from matplotlib.colors import to_hex

fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(
    xs=plz_df['einwohner'], 
    ys=plz_df['num_kitas_plz'], 
    zs=plz_df['spots_plz'], 
    c=to_hex(sns_c[0])
)
ax.set_title('Population, Number of Kitas ans Spots per PLZ iin Berlin', y=1.1);
ax.set(xlabel='population', ylabel='number of kitas', zlabel='spots');

Linear Regression Model

A natural question is: How do the relations above depend on the district? We of course expect different demographic distributions over different areas of the city. In this first iteration we run a simple linear regression model to estimate the variance explained of num_kitas_plz as a linear function of einwohner and district.

import statsmodels.formula.api as smf

model = smf.ols(formula='num_kitas_plz ~ einwohner + C(district)', data=plz_df)
result = model.fit()
print(result.summary())

    OLS Regression Results                            
    ==============================================================================
    Dep. Variable:          num_kitas_plz   R-squared:                       0.455
    Model:                            OLS   Adj. R-squared:                  0.423
    Method:                 Least Squares   F-statistic:                     14.00
    Date:                Sat, 19 Sep 2020   Prob (F-statistic):           6.33e-21
    Time:                        10:44:02   Log-Likelihood:                -696.16
    No. Observations:                 214   AIC:                             1418.
    Df Residuals:                     201   BIC:                             1462.
    Df Model:                          12                                         
    Covariance Type:            nonrobust                                         
    ===========================================================================================================
                                                  coef    std err          t      P>|t|      [0.025      0.975]
    -----------------------------------------------------------------------------------------------------------
    Intercept                                   0.8849      1.529      0.579      0.563      -2.130       3.900
    C(district)[T.Friedrichshain-Kreuzberg]     6.0192      2.399      2.509      0.013       1.289      10.750
    C(district)[T.Lichtenberg]                 -3.8870      2.365     -1.644      0.102      -8.550       0.776
    C(district)[T.Marzahn-Hellersdorf]         -6.0663      2.304     -2.633      0.009     -10.610      -1.523
    C(district)[T.Mitte]                        2.2961      1.903      1.206      0.229      -1.457       6.049
    C(district)[T.Neukölln]                    -1.1441      2.067     -0.554      0.581      -5.219       2.931
    C(district)[T.Pankow]                       1.1004      2.062      0.534      0.594      -2.966       5.167
    C(district)[T.Reinickendorf]               -3.6003      2.117     -1.700      0.091      -7.776       0.575
    C(district)[T.Spandau]                     -3.0675      2.293     -1.338      0.183      -7.589       1.454
    C(district)[T.Steglitz-Zehlendorf]         -3.5661      1.932     -1.846      0.066      -7.376       0.244
    C(district)[T.Tempelhof-Schöneberg]        -2.4554      1.751     -1.402      0.162      -5.908       0.997
    C(district)[T.Treptow-Köpenick]            -1.0303      2.120     -0.486      0.627      -5.210       3.149
    einwohner                                   0.0008   8.15e-05      9.586      0.000       0.001       0.001
    ==============================================================================
    Omnibus:                       13.969   Durbin-Watson:                   1.951
    Prob(Omnibus):                  0.001   Jarque-Bera (JB):               30.795
    Skew:                           0.237   Prob(JB):                     2.06e-07
    Kurtosis:                       4.797   Cond. No.                     2.02e+05
    ==============================================================================
    
    Warnings:
    [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
    [2] The condition number is large, 2.02e+05. This might indicate that there are
    strong multicollinearity or other numerical problems.

We now look into the model predictions.

# Collect predictions.
plz_model_df = pd.concat([plz_df, result.get_prediction(plz_df).summary_frame()], axis=1)
# Compute errors.
plz_model_df = plz_model_df.assign(error = lambda x: x['mean'] - x['num_kitas_plz'])

fig, ax = plt.subplots(figsize=(10, 8))
sns.scatterplot(
    x='num_kitas_plz',
    y='mean',
    data=plz_model_df,
    hue='district',
    palette='Paired',
    edgecolor='black',
    linewidth=0.5,
    size='spots_plz',
    ax=ax
)
ax.legend(**{'loc': 'center left', 'bbox_to_anchor': (1, 0.5)})
ax.set(
    title='Number of Kitas Linear Model',
    xlabel='number of kitas',
    ylabel='prediction'
);

Finally, we plot the errors distribution:

error_mean = plz_model_df['error'].mean()
error_std = plz_model_df['error'].std()

fig, ax = plt.subplots(nrows=2, ncols=1, figsize=(12, 7), constrained_layout=True)
sns.histplot(x='error', data=plz_model_df, color=sns_c[9], kde=True, ax=ax[0])
ax[0].axvline(x=error_mean, color='black', linestyle='--', label=f'$\mu$')
ax[0].axvline(x=error_mean + 2*error_std, color='gray', linestyle='--', label=f'$\mu\pm2\sigma$')
ax[0].axvline(x=error_mean - 2*error_std, color='gray', linestyle='--')
sns.scatterplot(x='einwohner', y='error', data=plz_model_df, hue='district', palette='Paired', edgecolor='black', ax=ax[1])
ax[1].axhline(y=error_mean, color='black', linestyle='--', label=f'$\mu$')
ax[1].axhline(y=error_mean + 2*error_std, color='gray', linestyle='--', label=f'$\mu\pm2\sigma$')
ax[1].axhline(y=error_mean - 2*error_std, color='gray', linestyle='--')
ax[0].legend()
ax[1].legend(**{'loc': 'center left', 'bbox_to_anchor': (1, 0.5)})
ax[0].set(title='Model Erros Distribution');

The model is not perfect, but is a good place to start for a deeper analysis: ideally more data about the district demographics will be added and Bayesian methods like the ones presented in Statistical Rethinking by Richard McElreath.