Dr. Juan Camilo Orduz
/
Recent content on Dr. Juan Camilo OrduzHugo -- gohugo.ioen-usMon, 20 Feb 2023 00:00:00 +0000Experimentation, Non-Compliance and Instrumental Variables with PyMC
/iv_pymc/
Mon, 20 Feb 2023 00:00:00 +0000/iv_pymc/In this notebook we present an example of how to use PyMC to estimate the effect of a treatment in an experiment where there is non-compliance through the use of instrumental variables.
By non-compliance we mean that the treatment assignment does not guarantee that the treatment is actually received by the treated. The main challenge is that we can not simply estimate the treatment effect as a difference in means since the non-compliance mechanism is most of the time not at random and may introduce confounders.Cohort Revenue & Retention Analysis: A Bayesian Approach
/revenue_retention/
Mon, 23 Jan 2023 00:00:00 +0000/revenue_retention/In this notebook we extend the cohort retention model presented in the post Cohort Retention Analysis with BART so that we just model retention and per cohort simultaneously (we recommend reading the referenced post before this one). The idea is to keep modeling the retention using a Bayesian Additive Regression Tree (BART) model (see pymc-bart) and linearly model the revenue per cohort using a Gamma distribution. We couple the retention and revenue components in a similar way as presented in the notebook Introduction to Bayesian A/B Testing.Cohort Retention Analysis with BART
/retention_bart/
Mon, 02 Jan 2023 00:00:00 +0000/retention_bart/In this notebook we study an alternative approach for the cohort analysis problem presented in A Simple Cohort Retention Analysis in PyMC. Instead of using a linear model to estimate the retention rate, we use a Bayesian Additive Regression Tree (BART) model(see pymc-bart). The BART model is a flexible non-parametric model that can be used to model complex relationships between the response and the predictors.
Prepare Notebook import arviz as az import matplotlib.A Simple Cohort Retention Analysis in PyMC
/retention/
Tue, 20 Dec 2022 00:00:00 +0000/retention/In this notebook we present a simple approach to study cohort retention analysis through a simulated data set. The aim is to understand how retention rates change over time and provide a simple model to predict them (with uncertainty estimates!). We do not expect this technique to be a silver bullet for all retention problems, but rather a simple approach to get started with the problem.
Remark: A motivation for this notebook was the great post Bayesian Age/Period/Cohort Models in Python with PyMC by Austin Rochford.Geo-Experimentation via Time Based Regression in PyMC
/time_based_regression_pymc/
Thu, 01 Dec 2022 00:00:00 +0000/time_based_regression_pymc/Introduction In this notebook I describe and present an implementation of the time based regression (TBR) approach to marketing campaign analysis in the context of geo experimentation presented in the paper Estimating Ad Effectiveness using Geo Experiments in a Time-Based Regression Framework by Jouni Kerman, Peng Wang and Jon Vaver (Google, Inc. 2017). I strongly recommend reading the paper as it is quite clear in the exposition of the approach and presents some simulation results.Offline Campaign Analysis Measurement: A journey through causal impact, geo-experimentation and synthetic control
/wolt_ds_meetup/
Tue, 25 Oct 2022 00:00:00 +0000/wolt_ds_meetup/In October 2022 I had the opportunity to give a talk at the Helsinki Data Science Meetup hosted by Wolt. Here I want to share the recording of my talk.
Abstract: The talk will show how to measure offline campaigns using causal inference techniques. In particular it’ll focus on tapping into the potential of synthetic control, geo-experiments via time-based regression, and Google’s Causal-Impact Method.
Code to generate data You can find the raw data here and the code here.Scikit-Learn Example in PyMC: Gaussian Process Classifier
/sklearn_pymc_classifier/
Sat, 24 Sep 2022 00:00:00 +0000/sklearn_pymc_classifier/In this notebook we want to describe how to port a couple of classification examples from scikit-learn’s documentation (classifier comparison) to PyMC. We focus in the classical moons synthetic dataset.
Prepare Notebook import arviz as az import matplotlib.pyplot as plt import numpy as np import pandas as pd import pymc as pm import pymc.sampling_jax import seaborn as sns from sklearn.datasets import make_moons from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split plt.Synthetic Control in PyMC
/synthetic_control_pymc/
Tue, 09 Aug 2022 00:00:00 +0000/synthetic_control_pymc/Synthetic control can be considered “the most important innovation in the policy evaluation literature in the last few years” (see The State of Applied Econometrics: Causality and Policy Evaluation by Susan Athey and Guido W. Imbens).
In this notebook we provide an example of how to implement a synthetic control problem in PyMC to answer a “what if this had happened?” type of question in the context of causal inference.Modeling Short Time Series with Prior Knowledge in PyMC
/short_time_series_pymc/
Tue, 19 Jul 2022 00:00:00 +0000/short_time_series_pymc/In this notebook I want to reproduce in PyMC the methodology described in the amazing blog post Modeling Short Time Series with Prior Knowledge by Tim Radtke to forecast short time series using bayesian transfer learning 🚀. The main idea is to transfer information (e.g. long term seasonality) from a long time series to a short time series via prior distributions. Tim’s blog post treats a very concrete example where all the concepts become very concrete.Time-Varying Regression Coefficients via Gaussian Random Walk in PyMC
/bikes_pymc/
Sun, 03 Jul 2022 00:00:00 +0000/bikes_pymc/In this notebook we want to illustrate how to use PyMC to fit a time-varying coefficient regression model. The motivation comes from post Exploring Tools for Interpretable Machine Learning where we studied a time series problem, regarding the prediction of the number of bike rentals, from a machine learning perspective. Concretely, we fitted and compared two machine learning models: a linear regression with interactions and a gradient boost model (XGBoost).Data Talks Club: Machine Learning in Marketing
/machine_learning_marketing/
Tue, 17 May 2022 00:00:00 +0000/machine_learning_marketing/On Friday 13th of May 2022 I was invited to join Alexey Grigorev in an event organised by DataTalks.Club to talk about Machine Learning in Marketing. It was a really insightful discussion and i would like to thanks the organizers who make it possible. Here is the recording:
Here are some useful links and resources about the subject:
Relevant blog post I have written about the subject:PyConDE & PyData Berlin 2022: Introduction to Uplift Modeling
/uplift/
Mon, 11 Apr 2022 00:00:00 +0000/uplift/In this notebook we present a simple example of uplift modeling estimation via meta-models using causalml and scikit-uplift. For a more detailed introduction to uplift modeling, see:
Diemert, Eustache, et.al. (2020) “A Large Scale Benchmark for Uplift Modeling”
Gutierrez, P., & Gérardy, J. Y. (2017). “Causal Inference and Uplift Modelling: A Review of the Literature”
Karlsson, H. (2019) “Uplift Modeling: Identifying Optimal Treatment Group Allocation and Whom to Contact to Maximize Return on Investment”Gamma-Gamma Model of Monetary Value in PyMC
/gamma_gamma_pymc/
Tue, 29 Mar 2022 00:00:00 +0000/gamma_gamma_pymc/In this notebook we describe how to fit Fader’s and Hardie’s gamma-gamma model presented in the paper “RFM and CLV: Using Iso-value Curves for Customer Base Analysis” and the note “The Gamma-Gamma Model of Monetary Value”. The approach is very similar as the one presented in the previous post BG/NBD Model in PyMC where we simply ported the log-likelihood of the lifetimes package from numpy to theano.
Prepare Notebook import arviz as az import matplotlib.BG/NBD Model in PyMC
/bg_nbd_pymc/
Thu, 03 Mar 2022 00:00:00 +0000/bg_nbd_pymc/In this notebook we show how to port the BG/NBD model from the the lifetimes (developed mainly by Cameron Davidson-Pilon) package to pymc. The BG/NBD model, introduced in the seminal paper “Counting Your Customers” the Easy Way: An Alternative to the Pareto/NBD Model by Peter S. Fader, Bruce G. S. Hardie and Ka Lok Lee in 2005, is used to
predict future purchasing patterns, which can then serve as an input into “lifetime value” calculations, in the “non-contractual” setting (i.Media Effect Estimation with PyMC: Adstock, Saturation & Diminishing Returns
/pymc_mmm/
Fri, 11 Feb 2022 00:00:00 +0000/pymc_mmm/In this notebook we present a concrete example of estimating the media effects via bayesian methods, following the strategy outlined in Google’s paper Jin, Yuxue, et al. “Bayesian methods for media mix modeling with carryover and shape effects.” (2017). This example can be considered the continuation of the post Media Effect Estimation with Orbit’s KTR Model. However, it is not strictly necessary to read before as we make this notebook self-contained.Media Effect Estimation with Orbit's KTR Model
/orbit_mmm/
Fri, 04 Feb 2022 00:00:00 +0000/orbit_mmm/In this notebook we want to experiment to the new KTR model included in the new orbit’s release (1.1). In particular, we are interested in its applications to media effects estimation in the context of media mix modeling. This is one of the applications for the KTR model by the Uber’s team, see the corresponding paper Edwin, Ng, et al. “Bayesian Time Varying Coefficient Model with Applications to Marketing Mix Modeling”.Unobserved Components Model as a Bayesian Model with PyMC
/uc_pymc/
Fri, 10 Dec 2021 00:00:00 +0000/uc_pymc/In this notebook I want to deep-dive into the idea of wrapping a statsmodels UnobservedComponents model as a bayesian model with PyMC described in the (great!) post Fast Bayesian estimation of SARIMAX models. This is a nice excuse to get into some internals of how PyMC works. I hope this can serve as a complement to the original post mentioned above. This post has two parts: In the first one we fit a UnobservedComponents model to a simulated time series.ISLR2 - Survival Analysis Lab (lifelines)
/islr2_survival_analysis/
Wed, 01 Sep 2021 00:00:00 +0000/islr2_survival_analysis/In this notebook we provide a python implementation of the lab from the Survival Analysis - Chapter 11 of the second edition of the book An Introduction to Statistical Learning (Second Edition). You can find a free pdf version of the book here. We will use the lifelines python package, which you can find in this repository. There is a nice introduction into survival analysis on the documentation. There are also many concrete examples and guidelines to use the package.Exploring Tools for Interpretable Machine Learning
/interpretable_ml/
Thu, 01 Jul 2021 00:00:00 +0000/interpretable_ml/In this notebook we want to test various ways of getting a better understanding on how non-trivial machine learning models generate predictions and how features interact with each other. This is in general not straight forward and key components are (1) understanding on the input data and (2) domain knowledge on the problem. Two great references on the subject are:
Interpretable Machine Learning, A Guide for Making Black Box Models Explainable by Christoph Molnar Interpretable Machine Learning with Python by Serg Masís Note that the methods discussed in this notebook are not related with causality.Feature Engineering: patsy as FormulaTransformer
/formula_transformer/
Sat, 01 May 2021 00:00:00 +0000/formula_transformer/In this notebook I want to describe how to create features inside scikit-learn pipelines using patsy-like formulas. I have used this approach to generate features in a previous post: GLM in PyMC3: Out-Of-Sample Predictions, so I will consider the same data set here for the sake of comparison.
Remark: Very recently (2021-09-01) I discovered there is an implementation of this transformer in scikit-lego, see PatsyTransformer. In addition, please refer to the great tutorial on patsy in calmcode.GLM in PyMC3: Out-Of-Sample Predictions
/glm_pymc3/
Mon, 04 Jan 2021 00:00:00 +0000/glm_pymc3/In this notebook I explore the glm module of PyMC3. I am particularly interested in the model definition using patsy formulas, as it makes the model evaluation loop faster (easier to include features and/or interactions). There are many good resources on this subject, but most of them evaluate the model in-sample. For many applications we require doing predictions on out-of-sample data. This experiment was motivated by the discussion of the thread “Out of sample” predictions with the GLM sub-module on the (great!Gaussian Processes for Time Series Forecasting with PyMC3
/gp_ts_pymc3/
Sat, 02 Jan 2021 00:00:00 +0000/gp_ts_pymc3/In this notebook we translate the forecasting models developed for the post on Gaussian Processes for Time Series Forecasting with Scikit-Learn to the probabilistic Bayesian framework PyMC3. I strongly recommend looking into the following references for more details and examples:
References:
An Introduction to Gaussian Process Regression PyMC3 Docs: Gaussian Processes PyMC3 Docs Example: CO2 at Mauna Loa Bayesian Analysis with Python (Second edition) - Chapter 7 Statistical Rethinking - Chapter 14 Prepare Notebook1 import numpy as np import pandas as pd import matplotlib.Simple Bayesian Linear Regression with TensorFlow Probability
/tfp_lm/
Tue, 06 Oct 2020 00:00:00 +0000/tfp_lm/In this post we show how to fit a simple linear regression model using TensorFlow Probability by replicating the first example on the getting started guide for PyMC3. We are going to use Auto-Batched Joint Distributions as they simplify the model specification considerably. Moreover, there is a great resource to get deeper into this type of distribution: Auto-Batched Joint Distributions: A Gentle Tutorial, which I strongly recommend (see this post to get a brief introduction on TensorFlow probability distributions).Open Data: Berlin Kitas
/kitas_berlin/
Sat, 19 Sep 2020 00:00:00 +0000/kitas_berlin/In this notebook I want to explore some data I found on the Berlin Open Data portal daten.berlin.de. The data source contains information of Kitas (Kindertagesstätte, i.e. kindergartens) in Berlin. This is a big topic as finding a spot in a Kita in Berlin is extremely difficult. We first provide an initial exploratory data analysis of the data set, then we merge it with population data to create some geo-location maps.A Simple Hamiltonian Monte Carlo Example with TensorFlow Probability
/tfp_hcm/
Fri, 24 Jul 2020 00:00:00 +0000/tfp_hcm/In this post we want to revisit a simple bayesian inference example worked out in this blog post. This time we want to use TensorFlow Probability (TFP) instead of PyMC3.
References:
Statistical Rethinking is an amazing reference for Bayesian analysis. It also has a sequence of online lectures freely available on YouTube.
An introduction to probabilistic programming, now available in TensorFlow Probability
There are many examples on the TensorFlow’s GitHub repository.Regression Analysis & Visualization
/lm_viz/
Fri, 26 Jun 2020 00:00:00 +0000/lm_viz/In this notebook I want to collect some useful visualizations which can help model development and model evaluation in the context of regression analysis. I use many visualization resources not just only to share results but as a key component of my workflow: data QA, EDA, feature engineering, model development, model evaluation and communicating results. In this notebook I focus on a simple regression model (time series) with statsmodels and visualization with matplotlib and seaborn.A Glimpse into TensorFlow Probability Distributions
/intro_tfd/
Tue, 16 Jun 2020 00:00:00 +0000/intro_tfd/In this notebook we want to go take a look into the distributions module of TensorFlow probability. The aim is to understand the fundamentals and then explore further this probabilistic programming framework. Here you can find an overview of TensorFlow Probability. We will concentrate on the first part of Layer 1: Statistical Building Blocks. As you could see from the distributions module documentation, there are many classes of distributions. We will explore a small sample of them in order to get an overall overview.Disease Spread Simulation (Animation)
/infection_sim/
Tue, 28 Apr 2020 00:00:00 +0000/infection_sim/We describe how to generate a basic disease spread simulation. We explore how to do animations in Matplotlib.Getting Started with Spectral Clustering
/spectral_clustering/
Sat, 04 Apr 2020 00:00:00 +0000/spectral_clustering/In this post I want to explore the ideas behind spectral clustering. I do not intend to develop the theory. Instead, I will unravel a practical example to illustrate and motivate the intuition behind each step of the spectral clustering algorithm. I particularly recommend two references:
For an introduction/overview on the theory, see the lecture notes A Tutorial on Spectral Clustering by Prof. Dr. Ulrike von Luxburg. For a concrete application of this clustering method you can see the PyData’s talk: Extracting relevant Metrics with Spectral Clustering by Dr.The Volume of the d-Ball via Monte Carlo Simulation
/vol_d_ball/
Mon, 24 Feb 2020 00:00:00 +0000/vol_d_ball/In this notebook we run Monte Carlo simulations to estimate the volume of the \(d\)-ball \[ B^{d}:=\{x \in \mathbb{R}^d : ||x|| \leq 1\}. \] There are many ways to obtain a closed formula for this volume , see for example this Wikipedia article. Here we do it via sampling just for fun!
Main Idea Consider a square \(A_{d}\subset \mathbb{R}\) centered at the origin with side length \(2\). We estimate the volume of the \(d\)-ball \(B^{d}:=\{x \in \mathbb{R}^d : ||x|| \leq 1\}\subset A^{d}\) by sampling uniformly from \(A\) and computing the proportions of vectors having length less or equal than one.Forecasting Weekly Data with Prophet
/fb_prophet/
Fri, 21 Feb 2020 00:00:00 +0000/fb_prophet/In this notebook we are present an initial exploration of the Prophet package by Facebook. From the documentation:
Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.Exploring TensorFlow Probability STS Forecasting
/intro_sts_tfp/
Tue, 11 Feb 2020 00:00:00 +0000/intro_sts_tfp/In this notebook we explore the Structural Time Series (STS) Module of TensorFlow Probability. We follow closely the use cases presented in their Medium blog. As described there: An STS model expresses an observed time series as the sum of simpler components 1:
\[ f(t) = \sum_{k=1}^{N}f_{k}(t) + \varepsilon, \quad \text{where}\quad \varepsilon \sim N(0, \sigma^2). \]
Each summand \(f_{k}(t)\) has a particular structure, e.g. specific seasonality, trend, autoregressive terms, etc.Intro ML in Production: Flask, Docker and GitHub Actions
/ml_prod_intro/
Tue, 28 Jan 2020 00:00:00 +0000/ml_prod_intro/We describe how to set up a toy-model repository to train and dockerize a machine learning model with data store on aws s3.Drawing Manifolds in LaTeX with TikZ
/manifold_fig_latex/
Fri, 10 Jan 2020 00:00:00 +0000/manifold_fig_latex/We give some LaTex code to create figures of manifolds with boundaries.Open Data: Germany Maps Viz
/germany_plots/
Tue, 07 Jan 2020 00:00:00 +0000/germany_plots/In this post I want to show how to use public available (open) data to create geo visualizations in python. Maps are a great way to communicate and compare information when working with geolocation data. There are many frameworks to plot maps, here I focus on matplotlib and geopandas (and give a glimpse of mplleaflet).
Reference: A very good introduction to matplotlib is the chapter on Visualization with Matplotlib from the Python Data Science Handbook by Jake VanderPlas.The Graph Laplacian & Semi-Supervised Clustering
/semi_supervised_clustering/
Thu, 05 Dec 2019 00:00:00 +0000/semi_supervised_clustering/In this post we want to explore the semi-supervided algorithm presented Eldad Haber in the BMS Summer School 2019: Mathematics of Deep Learning, during 19 - 30 August 2019, at the Zuse Institute Berlin. He developed an implementation in Matlab which you can find in this GitHub repository. In addition, please find the corresponding slides here.
Prepare Notebook import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns; sns.The Lapacian on the 2-Torus
/laplacian_2torus/
Sun, 13 Oct 2019 00:00:00 +0000/laplacian_2torus/In this blog post I want to describe the explicit computation of the Laplacian on differential forms on the \(2\)-Torus \(T^2\subset \mathbb{R}^3\). This surface can be obtained by rotating the circle \((x-a)^2+y^2=r^2\) around the \(z\)-axis (\(0<r<a\)). Locally, this surface can be parametrized by the equations \[ x = (a+r\cos u)\cos v,\\ y = (a+r\cos u)\sin v,\\ z = r\sin u, \]
where \(0<u,v<2\pi\).PyData Berlin 2019: Gaussian Processes for Time Series Forecasting (scikit-learn)
/gaussian_process_time_series/
Thu, 10 Oct 2019 00:00:00 +0000/gaussian_process_time_series/In this notebook we run some experiments to demonstrate how we can use Gaussian Processes in the context of time series forecasting with scikit-learn. This material is part of a talk on Gaussian Process for Time Series Analysis presented at the PyCon DE & PyData 2019 Conference in Berlin.
Update: Additional material and plots were included for the Second Symposium on Machine Learning and Dynamical Systems at The Fields Institute (virtual event).satRday Berlin 2019: Remedies for Severe Class Imbalance
/class_imbalance/
Sat, 15 Jun 2019 00:00:00 +0000/class_imbalance/In this post I present a concrete case study illustrating some techniques to improve model performance in class-imbalanced classification problems. The methodologies described here are based on Chapter 16: Remedies for Severe Class Imbalance of the (great!) book Applied Predictive Modeling by Max Kuhn and Kjell Johnson. I absolutely recommend this reference to anyone interested in predictive modeling.
This notebook should serve as an extension of my talk given at satRday Berlin 2019: A conference for R users in Berlin.Seasonal Bump Functions
/bump_func/
Thu, 11 Apr 2019 00:00:00 +0000/bump_func/Motivated by the nice talk on Winning with Simple, even Linear, Models by Vincent D. Warmerdam, I briefly describe how to construct certain class of bump functions to encode seasonal variables in R.
Prepare Notebook library(glue) library(lubridate) library(magrittr) library(tidyverse) Generate Data Let us generate a time sequence variable stored in a tibble.
# Define time sequence. t <- seq.Date(from = as.Date("2017-07-01"), to = as.Date("2019-04-01"), by = "day") # Store it in a tibble.An Introduction to Gaussian Process Regression
/gaussian_process_reg/
Mon, 08 Apr 2019 00:00:00 +0000/gaussian_process_reg/Updated Version: 2019/09/21 (Extension + Minor Corrections)
After a sequence of preliminary posts (Sampling from a Multivariate Normal Distribution and Regularized Bayesian Regression as a Gaussian Process), I want to explore a concrete example of a gaussian process regression. We continue following Gaussian Processes for Machine Learning, Ch 2.
Other recommended references are:
Gaussian Processes for Timeseries Modeling by S. Roberts, M. Osborne, M. Ebden, S. Reece, N. Gibson & S.Bayesian Regression as a Gaussian Process
/reg_bayesian_regression/
Mon, 01 Apr 2019 00:00:00 +0000/reg_bayesian_regression/In this post we study the Bayesian Regression model to explore and compare the weight and function space and views of Gaussian Process Regression as described in the book Gaussian Processes for Machine Learning, Ch 2. We follow this reference very closely (and encourage to read it!). Our main objective is to illustrate the concepts and results through a concrete example. We use PyMC3 to run bayesian sampling.
References:Sampling from a Multivariate Normal Distribution
/multivariate_normal/
Sat, 23 Mar 2019 00:00:00 +0000/multivariate_normal/In this post I want to describe how to sample from a multivariate normal distribution following section A.2 Gaussian Identities of the book Gaussian Processes for Machine Learning. This is a first step towards exploring and understanding Gaussian Processes methods in machine learning.
Multivariate Normal Distribution Recall that a random vector \(X = (X_1, , X_d)\) has a multivariate normal (or Gaussian) distribution if every linear combination
\[ \sum_{i=1}^{d} a_iX_i, \quad a_i\in\mathbb{R} \] is normally distributed.Dockerize a ShinyApp
/dockerize-a-shinyapp/
Sat, 02 Mar 2019 00:00:00 +0000/dockerize-a-shinyapp/In this post I want to describe how to dockerize a simple Shiny App. Docker is a great way of sharing and deploying projects. You can download it here.
Resources:
R Docker tutorial, recommended for Docker beginners. Running a shiny app in a docker container by Mark Sellors (which is an updated and more complete version of this post). Assume you have a project folder structure as follows:The Spectral Theorem for Matrices
/the-spectral-theorem-for-matrices/
Sat, 02 Feb 2019 00:00:00 +0000/the-spectral-theorem-for-matrices/When working in data analysis it is almost impossible to avoid using linear algebra, even if it is on the background, e.g. simple linear regression. In this post I want to discuss one of the most important theorems of finite dimensional vector spaces: the spectral theorem. The objective is not to give a complete and rigorous treatment of the subject, but rather show the main ingredientes, some examples and applications.Movie Plots Text Generation with Keras
/movie_plot_text_gen/
Sun, 13 Jan 2019 00:00:00 +0000/movie_plot_text_gen/In this post I show some text generation experiments I ran using LSTM with Keras. For the preprocessing and tokenization I used SpaCy. The aim is not to present a completed project, but rather a first step which should be then iterated.
Resources There are many great resources and blog posts about the subject (and similar experiments). Here I mention the ones I found particularly useful for the general theory:Exploring the Curse of Dimensionality - Part II.
/exploring-the-curse-of-dimensionality-part-ii./
Tue, 01 Jan 2019 00:00:00 +0000/exploring-the-curse-of-dimensionality-part-ii./I continue exploring the curse of dimensionality. Following the analysis form Part I., I want to discuss another consequence of sparse sampling in high dimensions: sample points are close to an edge of the sample. This post is based on The Elements of Statistical Learning, Section 2.5, which I encourage to read!
Uniform Sampling Consider \(N\) data points uniformly distributed in a \(p\)-dimensional unit ball centered at the origin. Suppose we consider a nearest-neighbor estimate at the origin.Text Mining, Networks and Visualization: Plebiscito Tweets
/text-mining-networks-and-visualization-plebiscito-tweets/
Thu, 20 Dec 2018 00:00:00 +0000/text-mining-networks-and-visualization-plebiscito-tweets/Nowadays social media generates a vast amount of raw data (text, images, videos, etc). It is a very interesting challenge to discover techniques to get insights on the content and development of social media data. In addition, as a fundamental component of the analysis, it is important to find ways of communicating the results, i.e. data visualization. In this post I want to present a small case study where I analyze Twitter text data.Exploring the Curse of Dimensionality - Part I.
/exploring-the-curse-of-dimensionality-part-i./
Sun, 09 Dec 2018 00:00:00 +0000/exploring-the-curse-of-dimensionality-part-i./In this post I want to present the notion of curse of dimensionality following a suggested exercise (Chapter 4 - Ex. 4) of the book An Introduction to Statistical Learning, written by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.
When the number of features \(p\) is large, there tends to be a deterioration in the performance of KNN and other local approaches that perform prediction using only observations that are near the test observation for which a prediction must be made.From Pelican to Blogdown
/pelican_to_blogdown/
Sun, 02 Dec 2018 00:00:00 +0000/pelican_to_blogdown/Here I want to discuss my transition from Pelican to Blogdown and present some personal learnings. In June 2017 I decided to build a personal website/portafolio. I chose Pelican, because:
It is written in Python, which was the programing language I was mainly working on.
I wanted to include some Jupyter notebook I had already written.
A great post: Building a data science portfolio: Making a data science blog explaining the procedure and using GitHub Pages to publist it.\(S^1\)-Equivariant Dirac operators on the Hopf Fibration
/hopf_fibration/
Sun, 11 Nov 2018 00:00:00 +0000/hopf_fibration/In this expository article I discuss the definition and basic properties of the Hopf fibration, with particular emphasis on Dirac-type operators induced, in the sense of Brüning and Heintze, by the Hodge-de Rham and spin-Dirac operators. In addition, we compute the Dirac-Schrödinger type operator introduced in my PhD thesis.Introduction to R Plumber : Expose a Caret model to a web API
/intro_plumber/
Fri, 12 Oct 2018 00:00:00 +0000/intro_plumber/In this post we present a simple example of how to expose a prediction model to a web API using the Plumber package.Circle Radius Fit for a Cloud of Points
/circle-radius-fit-for-a-cloud-of-points/
Sun, 09 Sep 2018 00:00:00 +0000/circle-radius-fit-for-a-cloud-of-points/We explore how to include an R notebook into a pelican post. As an example, we describe how to fit a circle onto a cloud of points.From Bachelor to PhD: Geometric and Topological Methods for Quantum Field Theory
/vdl_experience/
Thu, 02 Aug 2018 00:00:00 +0000/vdl_experience/We give an introduction to PyMC3, a probabilistic programming framework written in Python. We revise the basic mathematical theory and present two concrete examples.PyData Berlin 2018: On Laplacian Eigenmaps for Dimensionality Reduction
/laplacian_eigenmaps_dim_red/
Sun, 08 Jul 2018 00:00:00 +0000/laplacian_eigenmaps_dim_red/This post contains the slides and material from a talk I gave at PyData Berlin 2018. I presented the paper <em>Laplacian Eigenmaps for Dimensionality Reduction and Data Representation</em> by <a href="http://web.cse.ohio-state.edu/~belkin.8/">Mikhail Belkin</a> and <a href="http://people.cs.uchicago.edu/~niyogi/">Partha Niyogi</a>.Probability that a given observation is part of a bootstrap sample?
/bootstrap/
Wed, 29 Nov 2017 00:00:00 +0000/bootstrap/We study the problem of computing the probability that a given observation is part of a bootstrap sample. We include some numerical simulations.Induced Dirac-Schrödinger operators on semi-free circle quotients
/phd/
Sat, 11 Nov 2017 00:00:00 +0000/phd/I present the content of my PhD Thesis in mathematics, which has now been published in The Journal of Geometric Analysis.Introduction to Bayesian Modeling with PyMC3
/intro_pymc3/
Sun, 13 Aug 2017 00:00:00 +0000/intro_pymc3/We give an introduction to PyMC3, a probabilistic programming framework written in Python. We revise the basic mahematical theory and present two concrete examples.Web scraping with Beautiful Soup: Plebiscito Colombia (October 2nd)
/plebiscito/
Sun, 09 Jul 2017 00:00:00 +0000/plebiscito/We describe how to use Beautiful Soup to scrape the official goverment website in order to get the results of the peace referendum in Colombia.The Dirac operator on the 2-sphere
/the-dirac-operator-on-the-2-sphere/
Thu, 29 Jun 2017 00:00:00 +0000/the-dirac-operator-on-the-2-sphere/The objective of this post is to explore MathJax, a JavaScript display engine for LaTeX. Being my first post writen with this tool, I want to present a short but fun example: I will give a description of the explicit computation of the spin-Dirac operator (of the unique complex spinor bundle!) on the 2-sphere \(S^2\) equipped with the standar round metric. A more detailed treatment can be found in my expository paper.Python Exercise: Distance to Rectangle
/rectangle/
Wed, 28 Jun 2017 00:00:00 +0000/rectangle/In this first post we get started with a small python script to explore the basic capabilities of Pelican.About
/about/
Mon, 01 Jan 0001 00:00:00 +0000/about/Dr. Juan Camilo Orduz Mathematician & Data Scientist
I am a mathematician (PhD from Humboldt Universität zu Berlin) and Data Scientist (Wolt). On this website you can find more information about me and some of my projects.
You can find the code associated with the blog posts on this GitHub repository.
I am part of the team running the Berlin Time Series Analysis Meetup.Curriculum Vitae
/cv/
Mon, 01 Jan 0001 00:00:00 +0000/cv/Work Experience 01/10/2021-Present (Senior) Data Scientist, Wolt, Berlin, Germany.
Member of the marketing tech team, a cross functional product team. I am leading the data science projects from conceptualisation, modelling to deployment. Developing data science products in the following domains: marketing attribution, customer lifetime value, churn prediction and prevention, cohort revenue-retention matrix modeling, A/B testing, marketing efficiency measurement and optimization through geo-experimentation, causal inference methods and media mix models (via modern bayesian techniques in PyMC).