21 min read

Open Data: Berlin Kitas

In this notebook I want to explore some data I found on the Berlin Open Data portal daten.berlin.de. The data source contains information of Kitas (Kindertagesstätte, i.e. kindergartens) in Berlin. This is a big topic as finding a spot in a Kita in Berlin is extremely difficult. We first provide an initial exploratory data analysis of the data set, then we merge it with population data to create some geo-location maps.

Disclaimer: In our way we do not expect to answer on how to find a Kita, but rather simpler questions about their location, distribution and sizes.

Prepare Notebook

import numpy as np
import pandas as pd
import geopandas as gpd
import mplleaflet
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import seaborn as sns
sns.set_style(
    style='darkgrid', 
    rc={'axes.facecolor': '.9', 'grid.color': '.8'}
)
sns.set_palette(palette='deep')
sns_c = sns.color_palette(palette='deep')
%matplotlib inline
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['figure.dpi'] = 100

Read Data

Let us load the data from the orignial Excel file (which I store locally).

kitas_raw_df = pd.read_excel(
    io='../Data/kitaliste_aktuell.xlsx', 
    skiprows=4, 
    dtype={'PLZ': str}
)
kitas_raw_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2739 entries, 0 to 2738
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Betreuungsbezirk Nr  2739 non-null   int64  
 1   Betreuungsbezirk     2739 non-null   object 
 2   Einrichtungsnummer   2739 non-null   int64  
 3   Einrichtungsname     2739 non-null   object 
 4   Einrichtungsadresse  2739 non-null   object 
 5   PLZ                  2739 non-null   object 
 6   Ort                  2739 non-null   object 
 7   Telefon              2708 non-null   object 
 8   Anzahl Plätze        2736 non-null   float64
 9   Einrichtungsart      2739 non-null   object 
 10  Trägernummer         2739 non-null   int64  
 11  Trägername           2739 non-null   object 
 12  Trägerart            2739 non-null   object 
dtypes: float64(1), int64(3), object(9)
memory usage: 278.3+ KB

Format Data

Next, we format the data and prepare it for the analysis.

# We verify there is just one city.
kitas_raw_df['Ort'].unique()
array(['Berlin'], dtype=object)
# There is a 1-1 correspondece between `Betreuungsbezirk Nr` and `Betreuungsbezirk`.
kitas_raw_df[['Betreuungsbezirk Nr', 'Betreuungsbezirk']].drop_duplicates()
Betreuungsbezirk Nr Betreuungsbezirk
0 1 Mitte
329 2 Friedrichshain-Kreuzberg
620 3 Pankow
994 4 Charlottenburg-Wilmersdorf
1261 5 Spandau
1403 6 Steglitz-Zehlendorf
1602 7 Tempelhof-Schöneberg
1863 8 Neukölln
2096 9 Treptow-Köpenick
2287 10 Marzahn-Hellersdorf
2420 11 Lichtenberg
2581 12 Reinickendorf
# Rename columns from German to English.
rename_cols = {
    'Betreuungsbezirk': 'district',
    'Einrichtungsnummer': 'id',
    'Einrichtungsname': 'name',
    'Einrichtungsadresse': 'address',
    'PLZ': 'plz',                  
    'Telefon': 'telephone',       
    'Anzahl Plätze': 'spots'  ,
    'Einrichtungsart': 'type',
    'Trägernummer': 'carrier_number',
    'Trägername': 'carrier_name',  
    'Trägerart': 'carrier_type',
}

columns_to_drop = [
    'Betreuungsbezirk Nr', 
    'Ort',
]

# Format data: remove redundant columns, rename columns and add new features.
kitas_df = kitas_raw_df \
    .copy() \
    .drop(columns_to_drop, axis=1) \
    .rename(columns=rename_cols) \
    .assign(
        num_kitas_plz=lambda x: x.groupby(['district', 'plz'])['id'].transform('count'),
        spots_plz=lambda x: x.groupby(['district', 'plz'])['spots'].transform(np.sum)
    )
    
kitas_df.head()
district id name address plz telephone spots type carrier_number carrier_name carrier_type num_kitas_plz spots_plz
0 Mitte 1010010 Kita F.A.I.R.play Albrechtstr. 020 10117 281 64 73 69.0 Kindertagesstätte 1224 GFJ - gemeinnützige Gesellschaft für Familien-… Sonstiger freier Träger 21 1550.0
1 Mitte 1010020 Kita Kinderwelt An der Kolonnade 003-5 10117 2291378 155.0 Kindertagesstätte 1334 Forum Soziale Dienste GmbH Sonstiger freier Träger 21 1550.0
2 Mitte 1010030 FRÖBEL Kindergarten Casa Fantasia Anklamer Str. 038 10115 4498171 69.0 Kindertagesstätte 1227 Fröbel Bildung und Erziehung gGmbH Sonstiger freier Träger 24 1807.0
3 Mitte 1010080 Kita Regenbogen Fehrbelliner Str. 080 10119 449 32 38 91.0 Kindertagesstätte 1202 Pfefferwerk Stadtkultur gGmbH Sonstiger freier Träger 11 1150.0
4 Mitte 1010100 FRÖBEL Kindergarten “Schatzinsel” Fischerinsel 008 10179 201 37 88 241.0 Kindertagesstätte 1227 Fröbel Bildung und Erziehung gGmbH Sonstiger freier Träger 10 939.0

Exploratory Data Analysis

One of the main objectives of this notebook is to do an exploratory data analysis to understand which questions this data set can answer. In addition, also determine its limitations. To begin, let us get the number of unique values per feature.

kitas_df.apply(lambda x: x.unique().size, axis=0)
    district            12
    id                2739
    name              2597
    address           2707
    plz                190
    telephone         2601
    spots              200
    type                 4
    carrier_number    1224
    carrier_name      1224
    carrier_type         8
    num_kitas_plz       37
    spots_plz          194
    dtype: int64
# Dimensions data set.
kitas_df.shape

(2739, 13)

We have 2739 Kitas in our data set and the id column is indeed an unique identifier of each Kita. The name, on the other hand, is not.

kitas_df \
    .groupby(['name']) \
    .agg(num_kitas=('id', 'count')) \
    .query('num_kitas > 1') \
    .sort_values('num_kitas', ascending=False) \
    .head()
num_kitas
name
Kleiner Fratz 18
IB-Kita, Internationaler Bund 8
Kita/Kinder in Bewegung (KiB) 6
Kita Sonnenschein 5
Kita Kinder in Bewegung (KiB) 5

Kitas per District

We start by looking into the first layer of geo-location granularity: district (there are 12 districts in Berlin).

fig, ax = plt.subplots()
kitas_df \
    .groupby(['district']) \
    .agg(n=('id', 'count')) \
    .reset_index(drop=False) \
    .sort_values('n', ascending=False) \
    .pipe((sns.barplot, 'data'), 
        x='n', 
        y='district',
        color=sns_c[0],
        edgecolor=sns_c[0],
        ax=ax
    )
ax.set(
    title='Number of Kitas per District in Berlin', 
    xlabel='number of kitas', 
    ylabel='district'
);
svg

We generate the same plot but showing the shares instead.

fig, ax = plt.subplots()
kitas_df \
    .groupby(['district']) \
    .agg(n=('id', 'count')) \
    .assign(share= lambda x: x['n'] / x['n'].sum()) \
    .reset_index(drop=False) \
    .sort_values('n', ascending=False) \
    .pipe((sns.barplot, 'data'), 
        x='share', 
        y='district',
        color=sns_c[1],
        edgecolor=sns_c[1],
        ax=ax
    )
ax.xaxis.set_major_formatter(mtick.FuncFormatter(lambda y, _: '{:0.0%}'.format(y)))
ax.set(
    title='Share of Kitas per District in Berlin', 
    xlabel='share of kitas', 
    ylabel='district'
);
svg

Pankow is a rather big district compared to Mitte and Friedrichshain-Kreuzberg area-wise. In order to run a fair comparison across districts, we need to control by size or population (which we will do bellow).

Kitas per PLZ

Now we dig deeper into plz level. A natural question is: How many Kitas each plz has? To answer it let us look into the distribution (again, in order to make a fair comparison we need to control by plz population or size).

fig, ax = plt.subplots()
kitas_df \
    .groupby(['district', 'plz']) \
    .agg(n=('name', 'count')) \
    .pipe((sns.histplot, 'data'),
        x='n', 
        bins=25, 
        kde=True,
        ax=ax
    )
ax.set(
    title='Number of Kitas per PLZ', 
     xlabel='number of kitas per plz',
     xlim=(0, None)
);
svg

We see there is a long tail. Let us see the plz with more Kitas:

fig, ax = plt.subplots(figsize=(12, 6))
kitas_df \
    .groupby(['district', 'plz']) \
    .agg(n=('id', 'count')) \
    .reset_index(drop=False) \
    .assign(plz = lambda x: 'plz_' + x['plz']) \
    .sort_values('n', ascending=False) \
    .head(40) \
    .pipe((sns.barplot, 'data'),  
        x='plz',
        y='n',
        hue='district',
        dodge=False,
        ax=ax
    )
ax.tick_params(axis='x', labelrotation=90)
ax.set(
    title='Number of Kitas per PlZ in Berlin (Top 40)', 
     xlabel='plz', 
     ylabel='number of kitas', 
);
svg

Let us locate plz = 10437 using Google maps.


This is a very trendy area in Berlin, specially for young families.

Next, let us plot the distribution of the number of Kitas per plz split by district.

g = kitas_df \
    .groupby(['district', 'plz']) \
    .agg(n=('name', 'count')) \
    .pipe((sns.displot, 'data'), 
        x='n',
        hue='district',
        kind='kde',
        rug=True, 
        height=5,
        aspect=1.3,
        palette='Paired'
    )
ax = g.axes.flatten()[0]
ax.set(
    title='Number of Kitas per PlZ', 
     xlabel='number of kitas per plz',
     xlim=(0, None)
);
svg
fig, ax = plt.subplots(figsize=(12, 6))
kitas_df \
    .groupby(['district', 'plz']) \
    .agg(n=('id', 'count')) \
    .reset_index(drop=False) \
    .sort_values('district') \
    .pipe((sns.boxplot, 'data'), 
        x='district', 
        y='n',
        color=sns_c[0],
        saturation=1.0,
        ax=ax
    )
ax.tick_params(axis='x', labelrotation=90)
ax.set(
    title='Number of Kitas per PlZ in Berlin within each District', 
     ylabel='number of kitas per plz', 
);
svg

It seems that, on average, Friedrichshain-Kreuzberg have more Kitas per plz and Pankow has the highest variance (again, we need to control by plz size to be more precise).

Spots per Kita

Let us now analyze the number of spots data.

# Some descriptive stats.
kitas_df['spots'].describe()

count 2736.000000 mean 67.617690 std 55.624663 min 0.000000 25% 25.000000 50% 45.000000 75% 95.000000 max 320.000000 Name: spots, dtype: float64

Note that there is a Kita wih spots=0. I am not sure about the reason behind this.

kitas_df.query('spots == 0')
district id name address plz telephone spots type carrier_number carrier_name carrier_type num_kitas_plz spots_plz
1122 Charlottenburg-Wilmersdorf 4073100 Dickenskita Dickensweg 017-19 14055 35109330 0.0 Kindertagesstätte 8435 Berlin British School gGmbH Sonstiger freier Träger 11 500.0

We plot the spots distribution:

spots_mean = kitas_df['spots'].describe()['mean']
spots_median = kitas_df['spots'].describe()['50%']

fig, ax = plt.subplots()
sns.histplot(x='spots', data=kitas_df, kde=True, color=sns_c[3], ax=ax)
ax.axvline(x=spots_mean, color=sns_c[1], linestyle='--', label=f'mean = {spots_mean: 0.0f}')
ax.axvline(x=spots_median, color=sns_c[4], linestyle='--', label=f'median = {spots_median: 0.0f}')
ax.legend(loc='upper right')
ax.set(
    title='Number of Spots per Kita', 
     xlabel='number of spots per kita',
     xlim=(0, None)
);
svg

Here is a list of the top 20 kitas with more spots:

kitas_df \
    [['name', 'district', 'plz', 'spots']] \
    .sort_values('spots', ascending=False) \
    .head(20) 
name district plz spots
54 BCS Kindergarten Preschool Mitte 10115 320.0
2441 Kita Lichtenzwerge Lichtenberg 10315 300.0
2428 Kita Bärenkinder Lichtenberg 10319 300.0
2543 Kita Neustrelitzer Str. 32-34/Kigä NordOst Lichtenberg 13055 285.0
2113 FRÖBEL Kindergarten-im-Grünen Treptow-Köpenick 12487 282.0
647 Klax Krippen Regentropfenhaus und Wolkenzwerge… Pankow 10439 280.0
1885 Kita Mini-Mix-International Neukölln 12053 280.0
351 FRÖBEL Kindergarten Am Ring Friedrichshain-Kreuzberg 10247 272.0
14 FRÖBEL Kindergarten “Traumzauberbaum” Mitte 10178 260.0
2590 Kita Treuenbrietzener Straße Reinickendorf 13439 252.0
2541 Kita Nido Piccolo Lichtenberg 13059 250.0
2566 Kita Tausendfüßler Lichtenberg 13055 250.0
329 Kita Spiel- und Erlebniswelt Friedrichshain-Kreuzberg 10243 248.0
346 FRÖBEL Kindergarten FRÖBELSPATZEN Friedrichshain-Kreuzberg 10243 246.0
864 Kita Am Brennerberg Pankow 13187 245.0
4 FRÖBEL Kindergarten “Schatzinsel” Mitte 10179 241.0
2115 FRÖBEL Kindergarten Wirbelwind Treptow-Köpenick 12435 240.0
335 Kita Gryphiusstr. 34-35 Friedrichshain-Kreuzberg 10245 240.0
2460 Kita Bunte Plonzstifte Lichtenberg 10365 240.0
901 Klax Krippe Sonnenhaus,Klax Kindergarten Wolke… Pankow 13189 240.0

Let us see the spots distributoin per district:

g = kitas_df \
    .pipe((sns.displot, 'data'), 
        x='spots',
        hue='district',
        kind='kde',
        rug=True, 
        height=5,
        aspect=1.3,
        palette='Paired'
    )
ax = g.axes.flatten()[0]
ax.set(
    title='Number of Spots per Kita', 
    xlabel='number of spots', 
    xlim=(0, None)
);
svg
fig, ax = plt.subplots(figsize=(12, 6))
kitas_df \
    .sort_values('district') \
    .pipe((sns.boxplot, 'data'), 
        x='district', 
        y='spots', 
        color=sns_c[3],
        saturation=1.0,
        ax=ax
    )
ax.axhline(y=kitas_df['spots'].describe()['mean'], color=sns_c[1], linestyle='--', label='mean')
ax.axhline(y=kitas_df['spots'].describe()['50%'], color=sns_c[4], linestyle='--', label='median')
ax.tick_params(axis='x', labelrotation=90)
ax.legend(loc='upper left')
ax.set(
    title='Number of Spots per Kita', 
     ylabel='number of spots', 
);
svg

Note that Pankow, Mitte and Friedrichshain-Kreuzberg have a lowe number of spots per Kita (these are the top 3 districts with mote Kitas). We want to see if there is a linear relation between the number of Kitas per plz and the number (median) of spots per Kita.

g = kitas_df \
    .groupby(['district', 'plz']) \
    .agg({'num_kitas_plz': np.mean, 'spots': np.median}) \
    .pipe((sns.jointplot, 'data'), 
        x='num_kitas_plz',  
        y='spots',
        kind='reg', 
        order=1,
        height=8,
        ratio=3, 
        marginal_kws=dict(color=sns_c[1])
    )
g.plot_joint(sns.kdeplot, zorder=0, levels=6, color=sns_c[3], alpha=0.7)
g.set_axis_labels(xlabel='number of kitas per plz', ylabel='median spots per plz')
g.fig.suptitle('Number of Kitas vs Spots per PLZ', y=1.03);
png

It seems there is a negative relation between them, i.e. plz with more Kitas seem to have less spots per Kita. Let us plot this per district:

g = kitas_df \
    .groupby(['district', 'plz'], as_index=False) \
    .agg({'num_kitas_plz': np.mean, 'spots': np.median}) \
    .pipe((sns.FacetGrid, 'data'),
        col='district', 
        col_wrap=4, 
        height=3,
    )
g.map(sns.regplot, 'num_kitas_plz', 'spots')
g.fig.suptitle('Number of Kitas vs Spots per PLZ', y=1.03);
MyImage

Carriers

Now we want to study the carrier feature. First, let us count the number of kitas per carrier_type.

fig, ax = plt.subplots()
kitas_df \
    .groupby(['carrier_type']) \
    .agg(n=('id', 'count')) \
    .reset_index() \
    .sort_values('n', ascending=False) \
    .assign(share= lambda x: x['n'] / x['n'].sum()) \
    .pipe((sns.barplot, 'data'), 
        x='share', 
        y='carrier_type',
        color=sns_c[2],
        edgecolor=sns_c[2],
        ax=ax
    )
fmt = lambda y, _: f'{y :0.0%}'
ax.xaxis.set_major_formatter(mtick.FuncFormatter(fmt))
ax.set(
    title='Number of Kitas per Carrier Type', 
    xlabel='number of kitas', 
    ylabel='carrier type'
);
MyImage

Here are some explanations on the carrier_type:

  • Arbeiterwohlfahrt: Workers welfare
  • Betriebskita: Company daycare
  • Deutscher Caritasverband: German Caritas Association
  • Eigenbetrieb: Own operation
  • EKT (Eltern-Initiativ-Kindertagesstätte): Parents initiative.
  • Diakonisches Werk: Diaconal work
  • private Kita ohne Abschluss der Rahmenvereinbarung: private daycare without signing the framework agreement
  • Sonstiger freier Träger: Other private carrier
# Note that there is. a 1-1 correspondence between `carrier_number` and `carrier_name`.
kitas_df \
    .groupby('carrier_number')\
    .agg({'carrier_name': 'nunique'}) \
    .query('carrier_name > 1') \
    .shape

(0, 1)

Now we plot the carrier_type distriibution per district.

fig, ax = plt.subplots(figsize=(15, 6))
kitas_df \
    .groupby(['district', 'carrier_type']) \
    .agg(n=('id', 'count')) \
    .reset_index(drop=False) \
    .sort_values('n', ascending=False) \
    .pipe((sns.barplot, 'data'), 
        y='n', 
        x='district',
        hue='carrier_type',
        edgecolor='black',
        dodge=True,
        ax=ax
    )
ax.tick_params(axis='x', labelrotation=90)
ax.set(
    title='Number of Kitas per District in Berlin (per Carrier Type)', 
    xlabel='district', 
    ylabel='number of kitas'
);
MyImage

We can also compute carrier_type share per district:

fig, ax = plt.subplots()
cmap = sns.light_palette(sns_c[0])
fmt = lambda y, _: f'{y :0.0%}'
kitas_df \
    .groupby(['district', 'carrier_type']) \
    .agg(n=('id', 'count')) \
    .reset_index(drop=False) \
    .assign(
        num_kitas_district = lambda x: x.groupby(['district'])['n'].transform(np.sum),
        share = lambda x: x['n'].div(x['num_kitas_district'])
    ) \
    .pivot(index='district', columns='carrier_type', values='share') \
    .fillna(0.0) \
    .pipe((sns.heatmap, 'data'), 
        vmin=0.0,
        vmax=1.0,
        cmap=cmap,
        linewidths=0.2, 
        linecolor='black',
        annot=True, 
        fmt='0.2%',
        cbar_kws={'format': mtick.FuncFormatter(fmt)},
        ax=ax
    )
ax.set(title='Carrier Type Share per District');
svg

It is very interesting to see how high the share of EKT in Charlottenburg-Wilmersdorf and Friedrichshain-Kreuzberg.

Now we want to investigate the carriers coverage on the number of kitas. The questions we want to answer are:

  1. Do carriers with more than one Kita dominate the Kita distribution?

  2. Do we see a difference across districts?

To answer both question we need to compute how many Kitas each carrier has. In addition,we sort them out by the number of Kitas calculate the (cumulative) share on the number of Kitas. This will allow us to answer the first question.

carrier_df = kitas_df \
    .groupby(['carrier_name']) \
    .agg(num_kitas=('name', pd.Series.nunique)) \
    .sort_values('num_kitas', ascending=False) \
    .assign(
        single_kita = lambda x: x['num_kitas'] < 2,
        share = lambda x: x['num_kitas'] / x['num_kitas'].sum(),
        cumsum_share = lambda x: x['share'].cumsum(),
        idx= lambda x: range(x.shape[0])
    )

carrier_df.head()
num_kitas single_kita share cumsum_share idx
carrier_name
Kindergärten NordOst 77 False 0.029079 0.029079 0
Kindertagesstätten Nordwest Eigenbetrieb von Berlin 63 False 0.023792 0.052870 1
Kindergärten City 56 False 0.021148 0.074018 2
Kindertagesstätten SüdOst Eigenbetrieb von Berlin 44 False 0.016616 0.090634 3
Kindertagesstätten Berlin Süd-West 37 False 0.013973 0.104607 4

Let us compute the distribution of carriers with a single Kita vs carriers with more than one Kita.

carrier_df \
    .groupby('single_kita')\
    .agg(n=('idx', 'count')) \
    .assign(share=lambda x: x['n'].div(x['n'].sum()))
n share
single_kita
False 365 0.298203
True 859 0.701797

Next we plot the cumulative share data on the number of Kitas:

k = 200 
k_cumsum_share = carrier_df.query(f'idx == {k - 1}')['cumsum_share'].values[0]
idx_single_kita = carrier_df.query('single_kita').iloc[0]['idx']
cumsum_share_single_kita = carrier_df.query('single_kita').iloc[0]['cumsum_share']

fig, ax = plt.subplots()
sns.lineplot(x='idx', y='cumsum_share', data=carrier_df, color=sns_c[0], label='cumulative sum share', ax=ax)
ax.axvline(x=idx_single_kita, color=sns_c[4], linestyle='--', label=f'{idx_single_kita} carriers have more than one Kita ({cumsum_share_single_kita: 0.2%})')
ax.axhline(y=cumsum_share_single_kita, color=sns_c[4], linestyle='--')
ax.axvline(x=k, color=sns_c[3], linestyle='--', label=f'{k} carriers account for {k_cumsum_share: 0.2%} of the Kitas')
ax.axhline(y=k_cumsum_share , color=sns_c[3], linestyle='--')
ax.legend(loc='lower right')
ax.set(
    title='Cumulative Share of Top Kita Carriers', 
    xlabel='carrier rank (based on number of Kitas)',
    ylabel='cumsum share'
);
svg

We have two main takeaways from this plot:

  1. 356 carriers, which are ~ 30% of the total number of carriers, cover ~ 67% of the number of Kitas.

  2. The top 200 carriers, which are ~ 16% of the total number of carriers, cover more than half (~ 55%) of the number of Kitas.

Hence, the carriers. with more than one Kita have a very high coverage.

# We merge the carriers data to. the Kitas data.
kitas_df = pd.merge(
    left=kitas_df, 
    right=carrier_df['single_kita'].reset_index(), 
    on='carrier_name', 
    how='left'
)

Now let us answer the second question by looking into the districts.

fig, ax = plt.subplots()
kitas_df \
    .groupby(['district', 'single_kita']) \
    .agg(n=('id', 'count')) \
    .reset_index(drop=False) \
    .sort_values('n', ascending=False) \
    .pipe((sns.barplot, 'data'), 
        x='n', 
        y='district',
        hue='single_kita',
        edgecolor='black',
        palette=[sns_c[3], sns_c[0]],
        ax=ax
    )
ax.set(
    title='Number of Kitas per District in Berlin (Carrier with/without Single Kita)', 
    xlabel='number of kitas', 
    ylabel='district'
);
svg

We now plot the relative share over districts of number of Kitas ans spots (which are of course correlated):

carrier_kitas_df = kitas_df \
    .groupby(['district', 'single_kita']) \
    .agg(n=('id', 'count'), spots=('spots', np.sum)) \
    .reset_index(drop=False) \
    .sort_values('district') \
    .assign(
        total_n = lambda x: x.groupby('district')['n'].transform(np.sum),
        total_spots = lambda x: x.groupby('district')['spots'].transform(np.sum),
        share_n = lambda x: x['n'].div(x['total_n']),
        share_spots = lambda x: x['spots'].div(x['total_spots']),
    )
fig, ax = plt.subplots(1, 2, constrained_layout=True)
cmap0 = sns.light_palette(sns_c[1])
cmap1 = sns.light_palette(sns_c[4])
fmt = lambda y, _: f'{y :0.0%}'
carrier_kitas_df \
    .pivot(index='district', columns='single_kita', values='share_n') \
    .pipe((sns.heatmap, 'data'), 
        vmin=0.0,
        vmax=1.0,
        cmap=cmap0,
        linewidths=0.2, 
        linecolor='black',
        annot=True, 
        fmt='0.2%',
        cbar_kws={'format': mtick.FuncFormatter(fmt)},
        ax=ax[0]
    )

carrier_kitas_df \
    .pivot(index='district', columns='single_kita', values='share_spots') \
    .pipe((sns.heatmap, 'data'), 
        vmin=0.0,
        vmax=1.0,
        cmap=cmap1,
        linewidths=0.2, 
        linecolor='black',
        annot=True, 
        fmt='0.2%',
        cbar_kws={'format': mtick.FuncFormatter(fmt)},
        ax=ax[1]
    )

ax[0].set(title='Number of Kitas Share', xlabel='carrier single kita')
ax[1].set(title='Number of Spots Share', xlabel='carrier single kita', ylabel='');
svg
  • In Charlottenburg-Wilmersdorf the share of carriers with/without single Kitas is quite balanced whereas in Marzahn-Hellersdorf and Treptow-Köpenick is quite unbalanced.

This ends the an initial exploratory data analysis of the Kitas data. There are still many interesting things to look, but we will leave them for a second iteration.

Maps & Population Data

Next we enrich out data set by including information about population per plz. In addition, we use maps to visualize some metrics. Please refer to the blog post Open Data: Germany Maps Viz for the description of the data sources.

  • Geo-location Data: geo-pandas dataframe with plz geometries.
plz_shape_df = gpd.read_file('../Data/plz-gebiete.shp', dtype={'plz': str})

plz_shape_df.head()
plz note geometry
0 52538 52538 Gangelt, Selfkant POLYGON ((5.86632 51.05110, 5.86692 51.05124, …
1 47559 47559 Kranenburg POLYGON ((5.94504 51.82354, 5.94580 51.82409, …
2 52525 52525 Waldfeucht, Heinsberg POLYGON ((5.96811 51.05556, 5.96951 51.05660, …
3 52074 52074 Aachen POLYGON ((5.97486 50.79804, 5.97495 50.79809, …
4 52531 52531 Übach-Palenberg POLYGON ((6.01507 50.94788, 6.03854 50.93561, …
  • Population Data
plz_einwohner_df = pd.read_csv(
    '../Data/plz_einwohner.csv',
    sep=',',
    dtype={'plz': str, 'einwohner': int}
)

plz_einwohner_df.head()
plz einwohner
0 01067 11957
1 01069 25491
2 01097 14811
3 01099 28021
4 01108 5876
  • Merge Data

We merge these data sets on plz.

plz_df = pd.merge(
    left=plz_shape_df[['plz', 'geometry']],
    right=plz_einwohner_df,
    on='plz',
    how='left'
)
# Merge with Kitas data.
plz_df = pd.merge(
    left=plz_df,
    right=kitas_df[['district', 'plz', 'num_kitas_plz', 'spots_plz']].drop_duplicates(),
    on='plz',
    how='inner'
)
# Add features: number of kitas per plz per capita (and its log-transform).
plz_df = plz_df.assign(
    num_kitas_plz_pc = lambda x: x['num_kitas_plz'].div(x['einwohner']), 
    num_kitas_plz_pc_log = lambda x: np.log(x['num_kitas_plz_pc'])
)

plz_df.head()
plz geometry einwohner district num_kitas_plz spots_plz num_kitas_plz_pc num_kitas_plz_pc_log
0 14109 POLYGON ((13.08835 52.41963, 13.09584 52.42198… 10049 Steglitz-Zehlendorf 8 357.0 0.000796 -7.135787
1 14089 POLYGON ((13.10929 52.45063, 13.10956 52.45108… 17734 Spandau 15 1112.0 0.000846 -7.075189
2 13591 POLYGON ((13.11738 52.51706, 13.11811 52.52010… 26762 Spandau 16 1595.0 0.000598 -7.422150
3 13587 POLYGON ((13.12796 52.58313, 13.12934 52.58593… 20108 Spandau 14 1113.0 0.000696 -7.269816
4 13593 POLYGON ((13.14288 52.52181, 13.14306 52.52179… 20238 Spandau 9 997.0 0.000445 -7.718093

First, let us visualize the plz per district in Berlin (we plot both static and dynamic maps).

fig, ax = plt.subplots(figsize=(12, 9))
plz_df.plot(
    ax=ax,
    column='district',
    categorical=True,
    cmap='Paired',
    edgecolor='black',
    linewidth=0.3,
    legend=True,
    legend_kwds={'title':'Districts', 'loc': 'center left', 'bbox_to_anchor': (1, 0.5)},
)
ax.set(
    title='Berlin PLZ',
    aspect=1.3
);
MyImage
fig, ax = plt.subplots(figsize=(12, 9))
plz_df.plot(
    ax=ax,
    column='district',
    categorical=True,
    cmap='Paired',
    edgecolor='black',
    linewidth=0.3,
)
mplleaflet.display(fig=fig)

We clearly see that the districts size in area are very different. Note for example how small Friedrichshain-Kreuzberg and Mitte (both in top 3 districs with more Kitas) are as compared with Treptow-Köpenick.

Next we plot the number if inhabitants per plz.

fig, ax = plt.subplots(figsize=(15, 10))
plz_df.plot(
    ax=ax,
    column='einwohner',  
    categorical=False, 
    cmap='plasma_r',
    edgecolor='black',
    linewidth=0.3
)
ax.set(
    title='Berlin - Population per PLZ',
    aspect=1.3
);
MyImage
fig, ax = plt.subplots(figsize=(15, 10))
plz_df.plot(
    ax=ax,
    column='einwohner',  
    categorical=False, 
    cmap='plasma_r',
    edgecolor='black',
    linewidth=0.3
)
mplleaflet.display(fig=fig)

Now we plot the number of Kitas per plz.

fig, ax = plt.subplots(figsize=(15, 10))
plz_df.plot(
    ax=ax,
    column='num_kitas_plz',
    categorical=False,
    cmap='viridis_r',
    edgecolor='black',
    linewidth=0.3
)
ax.set(
    title='Berlin - Number of Kitas per PLZ',
    aspect=1.3
);
MyImage
fig, ax = plt.subplots(figsize=(15, 10))
plz_df.plot(
    ax=ax,
    column='num_kitas_plz',
    categorical=False,
    cmap='viridis_r',
    edgecolor='black',
    linewidth=0.3
)
mplleaflet.display(fig=fig)

As we see, comparing plz data without controlling for population can lead to misleading conclusions. Let us not compute the number of Kitas per capita per district.

fig, ax = plt.subplots()
plz_df \
    .groupby(['district'], as_index=False) \
    .agg({'einwohner': np.sum, 'num_kitas_plz': np.sum}) \
    .assign(num_kitas_district_pc = lambda x: x['num_kitas_plz'].div(x['einwohner'])) \
    .sort_values('num_kitas_district_pc', ascending=False) \
    .pipe((sns.barplot, 'data'),
        x='num_kitas_district_pc',
        y='district',
        color=sns_c[4],
        edgecolor=sns_c[4],
        ax=ax,
    )
ax.set(
    title='Number of Kitas per District per Capita in Berlin',
    xlabel='number of kitas per capita',
    ylabel='district',
);
svg

We see that the top 3 districts remain, but now Friedrichshain-Kreuzberg and Mitte are at the top.

Next, we plot the number of Kitas per capita per plz (log-transform).

fig, ax = plt.subplots(figsize=(15, 10))
plz_df.plot(
    ax=ax,
    column='num_kitas_plz_pc_log',
    categorical=False,
    cmap='viridis_r',
    edgecolor='black',
    linewidth=0.3
)
ax.set(
    title='Berlin - (log) Number of Kitas per PLZ per Capita',
    aspect=1.3
);
MyImage

We immediately see that there is a rather small plz with a high (log) number of Kitas. Let us find which one it is:

plz_df.sort_values('num_kitas_plz_pc', ascending=False).drop('geometry', axis=1).head()
plz einwohner district num_kitas_plz spots_plz num_kitas_plz_pc num_kitas_plz_pc_log
18 14053 139 Charlottenburg-Wilmersdorf 3 70.0 0.021583 -3.835862
44 10709 8771 Charlottenburg-Wilmersdorf 18 824.0 0.002052 -6.188834
55 12165 4803 Steglitz-Zehlendorf 9 508.0 0.001874 -6.279771
146 10997 23800 Friedrichshain-Kreuzberg 42 1707.0 0.001765 -6.339771
68 10715 15464 Charlottenburg-Wilmersdorf 27 1027.0 0.001746 -6.350433

We not locate the plz = 14053.


This plz does not have many inhabitants, as it mainly encloses the Olympiastadion. We simply remove this data point from the map inorder to see a more faithful picture of the number of Kitas per capita per plz.

fig, ax = plt.subplots(figsize=(15, 10))
plz_df.query('plz != "14053"').plot(
    ax=ax,
    column='num_kitas_plz_pc',
    categorical=False,
    cmap='viridis_r',
    edgecolor='black',
    linewidth=0.3
)
ax.set(
    title='Berlin - Number of Kitas per PLZ per Capita',
    aspect=1.3
);
MyImage
fig, ax = plt.subplots(figsize=(15, 10))
plz_df.query('plz != "14053"').plot(
    ax=ax,
    column='num_kitas_plz_pc',
    categorical=False,
    cmap='viridis_r',
    edgecolor='black',
    linewidth=0.3
)
mplleaflet.display(fig=fig)

Number of Kitas, Population, Spots and Districts Relations

Finally, let us study the relations between number of kitas, spots and population per plz. These variables are naturally correlated:

corr_mat = plz_df[['einwohner', 'num_kitas_plz', 'spots_plz']].corr()
fig, ax = plt.subplots(figsize=(6, 6))
cmap = sns.diverging_palette(10, 220, as_cmap=True)
sns.heatmap(
    data=corr_mat, 
    vmin=-1.0, 
    vmax=1.0, 
    center=0, 
    cmap=cmap, 
    square=True,
    linewidths=0.5, 
    linecolor='k',
    annot=True, 
    fmt='.3f',
    ax=ax
)
ax.set_yticklabels(ax.get_yticklabels(), rotation=0, horizontalalignment='right')
ax.set_xticklabels(ax.get_yticklabels(), horizontalalignment='center')
ax.set(title='Correlarion Matrix');
svg

We can visualize them in a scatter plot.

fig, ax = plt.subplots(figsize=(10, 8))
sns.scatterplot(
    x='einwohner',
    y='num_kitas_plz',
    data=plz_df,
    hue='spots_plz',
    palette='viridis_r',
    edgecolor='black',
    linewidth=0.5,
    ax=ax
)
ax.legend(**{'loc': 'center left', 'bbox_to_anchor': (1, 0.5)})
ax.set(
    title='Number of Kitas vs Inhabitants per PLZ',
    xlabel='population',
    ylabel='number of kitas'
);
svg

Let us encode the district as color and spots as size.

fig, ax = plt.subplots(figsize=(10, 8))
sns.scatterplot(
    x='einwohner',
    y='num_kitas_plz',
    data=plz_df,
    hue='district',
    palette='Paired',
    edgecolor='black',
    linewidth=0.5,
    size='spots_plz',
    ax=ax
)
ax.legend(**{'loc': 'center left', 'bbox_to_anchor': (1, 0.5)})
ax.set(
    title='Number of Kitas vs Inhabitants per PLZ',
    xlabel='population',
    ylabel='number of kitas'
);
svg

Another way to represent their relation is using a 3D-plot.

from mpl_toolkits.mplot3d import Axes3D
from matplotlib.colors import to_hex

fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(
    xs=plz_df['einwohner'], 
    ys=plz_df['num_kitas_plz'], 
    zs=plz_df['spots_plz'], 
    c=to_hex(sns_c[0])
)
ax.set_title('Population, Number of Kitas ans Spots per PLZ iin Berlin', y=1.1);
ax.set(xlabel='population', ylabel='number of kitas', zlabel='spots');
svg

Linear Regression Model

A natural question is: How do the relations above depend on the district? We of course expect different demographic distributions over different areas of the city. In this first iteration we run a simple linear regression model to estimate the variance explained of num_kitas_plz as a linear function of einwohner and district.

import statsmodels.formula.api as smf

model = smf.ols(formula='num_kitas_plz ~ einwohner + C(district)', data=plz_df)
result = model.fit()
print(result.summary())
    OLS Regression Results                            
    ==============================================================================
    Dep. Variable:          num_kitas_plz   R-squared:                       0.455
    Model:                            OLS   Adj. R-squared:                  0.423
    Method:                 Least Squares   F-statistic:                     14.00
    Date:                Sat, 19 Sep 2020   Prob (F-statistic):           6.33e-21
    Time:                        10:44:02   Log-Likelihood:                -696.16
    No. Observations:                 214   AIC:                             1418.
    Df Residuals:                     201   BIC:                             1462.
    Df Model:                          12                                         
    Covariance Type:            nonrobust                                         
    ===========================================================================================================
                                                  coef    std err          t      P>|t|      [0.025      0.975]
    -----------------------------------------------------------------------------------------------------------
    Intercept                                   0.8849      1.529      0.579      0.563      -2.130       3.900
    C(district)[T.Friedrichshain-Kreuzberg]     6.0192      2.399      2.509      0.013       1.289      10.750
    C(district)[T.Lichtenberg]                 -3.8870      2.365     -1.644      0.102      -8.550       0.776
    C(district)[T.Marzahn-Hellersdorf]         -6.0663      2.304     -2.633      0.009     -10.610      -1.523
    C(district)[T.Mitte]                        2.2961      1.903      1.206      0.229      -1.457       6.049
    C(district)[T.Neukölln]                    -1.1441      2.067     -0.554      0.581      -5.219       2.931
    C(district)[T.Pankow]                       1.1004      2.062      0.534      0.594      -2.966       5.167
    C(district)[T.Reinickendorf]               -3.6003      2.117     -1.700      0.091      -7.776       0.575
    C(district)[T.Spandau]                     -3.0675      2.293     -1.338      0.183      -7.589       1.454
    C(district)[T.Steglitz-Zehlendorf]         -3.5661      1.932     -1.846      0.066      -7.376       0.244
    C(district)[T.Tempelhof-Schöneberg]        -2.4554      1.751     -1.402      0.162      -5.908       0.997
    C(district)[T.Treptow-Köpenick]            -1.0303      2.120     -0.486      0.627      -5.210       3.149
    einwohner                                   0.0008   8.15e-05      9.586      0.000       0.001       0.001
    ==============================================================================
    Omnibus:                       13.969   Durbin-Watson:                   1.951
    Prob(Omnibus):                  0.001   Jarque-Bera (JB):               30.795
    Skew:                           0.237   Prob(JB):                     2.06e-07
    Kurtosis:                       4.797   Cond. No.                     2.02e+05
    ==============================================================================
    
    Warnings:
    [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
    [2] The condition number is large, 2.02e+05. This might indicate that there are
    strong multicollinearity or other numerical problems.

We now look into the model predictions.

# Collect predictions.
plz_model_df = pd.concat([plz_df, result.get_prediction(plz_df).summary_frame()], axis=1)
# Compute errors.
plz_model_df = plz_model_df.assign(error = lambda x: x['mean'] - x['num_kitas_plz'])
fig, ax = plt.subplots(figsize=(10, 8))
sns.scatterplot(
    x='num_kitas_plz',
    y='mean',
    data=plz_model_df,
    hue='district',
    palette='Paired',
    edgecolor='black',
    linewidth=0.5,
    size='spots_plz',
    ax=ax
)
ax.legend(**{'loc': 'center left', 'bbox_to_anchor': (1, 0.5)})
ax.set(
    title='Number of Kitas Linear Model',
    xlabel='number of kitas',
    ylabel='prediction'
);
svg

Finally, we plot the errors distribution:

error_mean = plz_model_df['error'].mean()
error_std = plz_model_df['error'].std()

fig, ax = plt.subplots(nrows=2, ncols=1, figsize=(12, 7), constrained_layout=True)
sns.histplot(x='error', data=plz_model_df, color=sns_c[9], kde=True, ax=ax[0])
ax[0].axvline(x=error_mean, color='black', linestyle='--', label=f'$\mu$')
ax[0].axvline(x=error_mean + 2*error_std, color='gray', linestyle='--', label=f'$\mu\pm2\sigma$')
ax[0].axvline(x=error_mean - 2*error_std, color='gray', linestyle='--')
sns.scatterplot(x='einwohner', y='error', data=plz_model_df, hue='district', palette='Paired', edgecolor='black', ax=ax[1])
ax[1].axhline(y=error_mean, color='black', linestyle='--', label=f'$\mu$')
ax[1].axhline(y=error_mean + 2*error_std, color='gray', linestyle='--', label=f'$\mu\pm2\sigma$')
ax[1].axhline(y=error_mean - 2*error_std, color='gray', linestyle='--')
ax[0].legend()
ax[1].legend(**{'loc': 'center left', 'bbox_to_anchor': (1, 0.5)})
ax[0].set(title='Model Erros Distribution');
MyImage

The model is not perfect, but is a good place to start for a deeper analysis: ideally more data about the district demographics will be added and Bayesian methods like the ones presented in Statistical Rethinking by Richard McElreath.