Dr. Juan Camilo Orduz
/
Recent content on Dr. Juan Camilo OrduzHugo -- gohugo.ioen-usFri, 19 Jul 2024 00:00:00 +0000Multilevel Elasticities for a Single SKU - Part III.
/multilevel_elasticities_single_sku_3/
Fri, 19 Jul 2024 00:00:00 +0000/multilevel_elasticities_single_sku_3/In this notebook we continue our simulation study for elasticities, see:
Multilevel Elasticities for a Single SKU - Part I
Multilevel Elasticities for a Single SKU - Part II
for an introduction to the problem and some models. In this notebook we extend the covariance model to allow two covariance components on both the intercepts and slopes (coefficient of the median_ income variable). Also, to abstract from a specific framework, we do the implementation in NumPyro.Hierarchical Exponential Smoothing Model
/hierarchical_exponential_smoothing/
Fri, 07 Jun 2024 00:00:00 +0000/hierarchical_exponential_smoothing/In this blog post, we experiment with a hierarchical exponential smoothing forecasting model, extending the ideas from the univariate case presented in the blog post “Notes on Exponential Smoothing with NumPyro”. We use NumPyro and compare the NUTS and SVI results. For such a purpose, we use Continuous Ranked Probability Score (CRPS). We also compare these forecasts with univariate statistical models like Holt-Winters, AutoETS and Seasonal Naive from the great Statsforecast package.Demand Forecasting with Censored Likelihood
/demand/
Sun, 21 Apr 2024 00:00:00 +0000/demand/In this notebook we will explore the use of censored likelihoods for demand forecasting.
Business Problem: Let us assume we have a store with a single product. We want to forecast the (true!) demand for this product for the next 30 days using historical data. The historical data consists of the daily sales of the product for the last year (approximately). An important challenge is that our historical sales data is censored.A Conceptual and Practical Introduction to Hilbert Space GPs Approximation Methods
/hsgp_intro/
Thu, 18 Apr 2024 00:00:00 +0000/hsgp_intro/In this notebook, we explore the (conceptual) ideas and (practical) implementation details of the Hilbert Space approximation for Gaussian processes introduced in the article “Hilbert space methods for reduced-rank Gaussian process regression” by Arno Solin and Simo Särkkä. We do not go deep into the mathematical details (proofs) but focus on the core ideas to help us understand the main concepts guiding the technical implementation. We provide examples, both in NumPyro and PyMC, so that users can learn from cross-framework comparison so that we abstract the core ideas.Bayesian Censoring Data Modeling
/censoring/
Mon, 26 Feb 2024 00:00:00 +0000/censoring/In this notebook, we explore how we can use Bayesian modeling to estimate the parameters of a censored data set. These datasets are common in many fields, including survival analysis and supply chain management. I was motivated to write this notebook after reading the excellent blog post “Modeling Anything With First Principles: Demand under extreme stockouts” by Kyle Caron where he uses these techniques to model and balance demand under extreme stockouts and other constraints.Zero-Inflated TSB Model
/zi_tsb_numpyro/
Tue, 20 Feb 2024 00:00:00 +0000/zi_tsb_numpyro/After going through the fundamentals of the TSB Method for Intermittent Time Series Forecasting in NumPyro in the previous notebook, we explore a variation of it that might be useful for certain applications. In a nutshell, we keep the same model structure of the TSB model, but we modify the likelihood function to account for the sparsity of the time series. Concretely, we replace the classic Gaussian likelihood function with a zero-inflated Negative Binomial likelihood function.TSB Method for Intermittent Time Series Forecasting in NumPyro
/tsb_numpyro/
Sat, 17 Feb 2024 00:00:00 +0000/tsb_numpyro/In this notebook we provide a NumPyro implementation of the TSB (Teunter, Syntetos and Babai) method for forecasting intermittent time series. The TSB method is similar to the Croston’s method in the sense that is constructs two different time series out of the original one and then forecast each of them separately, so that the final forecast is generated by combining the forecasts of the two time series. The main difference between the two methods is that the TSB method uses the demand probability instead of the demand periods.Croston's Method for Intermittent Time Series Forecasting in NumPyro
/croston_numpyro/
Thu, 15 Feb 2024 00:00:00 +0000/croston_numpyro/In this notebook, we will implement Croston’s method for intermittent demand forecasting using NumPyro. Croston’s method is a popular forecasting method for intermittent demand data, which is characterized by a large number of zero values. The method is based on the idea of separating the demand size and the demand interval, and then forecasting them separately using simple exponential smoothing. We therefore can leverage on top of the previous post Notes on Exponential Smoothing with NumPyro.Notes on an ARMA(1, 1) Model with NumPyro
/arma_numpyro/
Tue, 13 Feb 2024 00:00:00 +0000/arma_numpyro/This are some notes on how to implement an ARMA(1, 1) model using NumPyro for time series forecasting. The ARMA(1, 1) model is given by
\[y_t = \mu + \phi y_{t-1} + \theta \varepsilon_{t-1} + \varepsilon_t\]
where \(y_t\) is the time series, \(\mu\) is the mean, \(\phi\) is the autoregressive parameter, \(\theta\) is the moving average parameter, and \(\varepsilon_t\) is a white noise process with mean zero and variance \(\sigma^2\).Notes on Exponential Smoothing with NumPyro
/exponential_smoothing_numpyro/
Sun, 11 Feb 2024 00:00:00 +0000/exponential_smoothing_numpyro/This notebook serves as personal notes on NumPyro’s implementation of the classic exponential smoothing forecasting method. I use Example: Holt-Winters Exponential Smoothing. The strategy is to go into the nitty-gritty details of the code presented in the example from the documentation: “Example: Holt-Winters Exponential Smoothing”. In particular, I want to understand the auto-regressive components using the scan function, which always confuses me 😅. After reproducing the example from the documentation, we go a step further and extend the algorithm to include a damped trend.Media Mix Model and Experimental Calibration: A Simulation Study
/mmm_roas/
Sun, 04 Feb 2024 00:00:00 +0000/mmm_roas/In this notebook, we present a complete simulation study of the media mix model (MMM) and experimental calibration method presented in the paper “Media Mix Model Calibration With Bayesian Priors”, by Zhang, et al., where the authors propose a convenient parametrization the regression model in terms of the ROAs (return on advertising spend) instead of the classical regression (beta) coefficients. The benefit of this parametrization is that it allows for using Bayesian priors on the ROAS, which typically come from previous experiments or domain knowledge.Cohort Revenue Retention Analysis with Flax and NumPyro
/revenue_retention_numpyro/
Mon, 08 Jan 2024 00:00:00 +0000/revenue_retention_numpyro/In this notebook we present an alternative implementation of the cohort-revenue-retention model presented in the blog post Cohort Revenue & Retention Analysis: A Bayesian Approach where we show how to replace the BART retention component with a general neural network implemented with Flax. This allows faster inference, as we can use NumPyro’s NUTS sampler or any of the stochastic variational inference (SVI) algorithms available. We could even use a wider family of samplers using the newly released package Bayeux or the great BlackJax (see for example, the MLP Classifier Example).Flax and NumPyro Toy Example
/flax_numpyro/
Fri, 05 Jan 2024 00:00:00 +0000/flax_numpyro/In this notebook I want to experiment with the numpyro/contrib/module.py module which allow us to integrate Flax models with NumPyro models. I am interested in this because I want to experiment with complex bayesian models with larger datasets.
Most of the main components can be found in the great blog post Bayesian Neural Networks with Flax and Numpyro. The author takes a different path working directly with potentials, but he also points out the recent addition of the numpyro/contrib/module.Time Series Modeling with HSGP: Baby Births Example
/birthdays/
Tue, 02 Jan 2024 00:00:00 +0000/birthdays/In this notebook we want to reproduce a classical example of using Gaussian processes to model time series data: The birthdays data set. I first encountered this example in the seminal book Chapter 21, Bayesian Data Analysis (Third edition) when learning about the subject. One thing I rapidly realized was that fitting these types of models in practice is very computationally expensive and sometimes almost infeasible for real industry applications where the data size is larger than all of these academic examples.Non-Parametric Product Life Cycle Modeling
/iphone_trends/
Sun, 10 Dec 2023 00:00:00 +0000/iphone_trends/In this notebook we present an example of how to use a combination of Bayesian hierarchical models and the non-parametric methods , namely bayesian additive trees (BART), to model the product life cycles. This approach is motivated by previous work in cohort analysis, see here.
As a case study we use the Google search index (trends) data for iPhones worldwide. We use the data of four different models to predict the development of the latest iPhone.NumPyro with Pathfinder
/numpyro_pathfinder/
Mon, 04 Dec 2023 00:00:00 +0000/numpyro_pathfinder/In this notebook we describe how to use blackjax’s pathfinder implementation to do inference with a numpyro model.
I am simply putting some pieces together from the following resources (strongly recommended to read):
References:
Blackjax docs: Use with Numpyro models Blackjax Sampling Book: Pathfinder Numpyro Issue #1485 PyMC Experimental - Pathfinder Pathfinder: Parallel quasi-Newton variational inference What and Why Pathfinder? From the paper’s abstract:
What? We propose Pathfinder, a variational method for approximately sampling from differentiable log densities.Causal Bandits: Causality, Marketing & Simulations
/causal_bandits/
Sun, 19 Nov 2023 00:00:00 +0000/causal_bandits/ I had the great opportunity to have a conversation with Aleksander Molak about causality, marketing and simulations as well as my career from academia to industry. Check it out!
Multilevel Elasticities for a Single SKU - Part II.
/multilevel_elasticities_single_sku_2/
Mon, 13 Nov 2023 00:00:00 +0000/multilevel_elasticities_single_sku_2/In this notebook we go deeper into the last covariance model presented in the previous blog post Multilevel Elasticities for a Single SKU. In particular we describe how to generate posterior predictive samples from an unseen region by the model. This can be useful for scenario planning: once can simulated outcome quantities from price ranges through the elasticity estimates (with uncertainty!)
We strongly recommend reading the previous blog post before reading this one as we will skip the EDA and baseline model comparison parts.Multilevel Elasticities for a Single SKU - Part I.
/multilevel_elasticities_single_sku/
Thu, 10 Aug 2023 00:00:00 +0000/multilevel_elasticities_single_sku/In this notebook I want to experiment with some basic models for price elasticity estimation in the simple context of a simple sku across multiple stores and regions. The motivation is to have a concrete example of the elasticity models presented in the Chapter 11: Big Data Pricing Models of the book Pricing Analytics by Walter R. Paczkowski.
Elasticity Definition Here I provide a very succinct definition of elasticity (there is a vast literature on this topic, see the reference above).Time-Varying Regression Coefficients via Hilbert Space Gaussian Process Approximation
/bikes_gp/
Wed, 05 Jul 2023 00:00:00 +0000/bikes_gp/In this notebook we present an example of a regression model with time varying coefficients using Gaussian processes. In particular, we use a Hilbert space Gaussian process approximation in pymc to speed up the computations (see HSGP). We continue using the bikes dataset from the previous posts (Exploring Tools for Interpretable Machine Learning and Time-Varying Regression Coefficients via Gaussian Random Walk in PyMC). Please refer to those posts for more details on the dataset, EDA and base models.Using Data Science for Bad Decision-Making: A Case Study
/causal_inference_example/
Fri, 16 Jun 2023 00:00:00 +0000/causal_inference_example/You will probably be intrigued by the title of this post. In this notebook I do not want to present a fancy data science trick or to test a novel technique. I would simply like to tell a story. A story about how data science can be used to make bad decisions. “How can this be?” you might ask. Everyone has been saying that data is the way to unlock insights to gain a competitive advantage.Regression Discontinuity with GLMs and Kernel Weighting
/regression_glmdiscontinuity_glm/
Sat, 10 Jun 2023 00:00:00 +0000/regression_glmdiscontinuity_glm/In this notebook we explore regression discontinuity design using generalized linear models (GLMs) and kernel weighting from a bayesian perspective. The motivation comes from applications when:
The data does not fit the usual linear regression OLS normal likelihood (e.g. modeling count data). The data size is limited. In addition, we experiment with kernel weighting to weight the data points near the cutoff more heavily. This is a common technique in RD analysis, but it is not always clear how to do this with GLMs in the bayesian framework.ATE Estimation for Count Data
/causal_inference_negative_binomial/
Wed, 07 Jun 2023 00:00:00 +0000/causal_inference_negative_binomial/This notebook is a continuation of the previous notebook on ATE estimation for binary data with logistic regression based on the sequence of (great!) posts by Solomon Kurz. In this notebook, we will focus on count data. We reproduce in python an example presented in the post Causal inference with count regression by Solomon Kurz. Our intention is to simply show how to port these type of model to bambi.ATE Estimation with Logistic Regression
/causal_inference_logistic/
Sun, 04 Jun 2023 00:00:00 +0000/causal_inference_logistic/In this notebook, I want to reproduce some components of the extensive blog post Causal inference with Bayesian models by Solomon Kurz. Specifically, I want to deep dive into the logistic regression model used to estimate the average treatment effect (ATE) of the study Internet-accessed sexually transmitted infection (e-STI) testing and results service: A randomised, single-blind, controlled trial by Wilson, et.al. I can only recommend to read the original sequence of posts Solomon has written on causal inference.Bayesian Methods in Modern Marketing Analytics Webinar with PyMC Labs
/marketing_bayes_webinar/
Wed, 31 May 2023 00:00:00 +0000/marketing_bayes_webinar/Here I want to share the recording and slides of the webinar Bayesian Methods in Modern Marketing Analytics in collaboration with PyMC Labs.
Abstract: During the webinar, we will discuss some of the most crucial topics in marketing analytics: media spend optimization through media mix models and experimentation, and customer lifetime value estimation. We will approach these topics from a Bayesian perspective, as it gives us great tools to have better models and more actionable insights.How to vectorize an scikit-learn transformer over a numpy array?
/vectorize_sklearn_transformer/
Mon, 01 May 2023 00:00:00 +0000/vectorize_sklearn_transformer/In this short post, I show how to vectorize an scikit-learn transformer over a numpy array. That is, how to apply a transformer along a specific axes of a numpy array. I have found this to be particularly useful when working with output sample posterior distributions from a bayesian model where I want to apply a transformer to each sample. This is not particularly difficult, but I always forget how to do it, so I thought I would write it down once and for all 😄.Counting the Number of Kitas per PLZ in Berlin using a Hierarchical Bayesian Model
/kitas-hierarchical/
Fri, 28 Apr 2023 00:00:00 +0000/kitas-hierarchical/This notebook is the continuation of data gathering and data analysis post Open Data: Berlin Kitas. In this second part we use the data gathered to model the number of Kitas per PLZ in Berlin using a hierarchical bayesian model. The hierarchy is defined by the Berlin districts. The objective is to develop a sound basic model which can be enhanced in the future with a richer data set.Simple Hierarchical Model with NumPyro: Cookie Chips Example
/cookies_example_numpyro/
Mon, 24 Apr 2023 00:00:00 +0000/cookies_example_numpyro/This notebook presents a simple example of a hierarchical model using NumPyro. The example is based on the cookie chips example in presented in the post Introduction to Bayesian Modeling with PyMC3. There are many great resources regarding bayesian hierarchical model and probabilistic programming NumPyro. This notebook aims to provide a succinct simple example to get started.
Remark: Well, the real reason is that I want to get acquainted other probabilistic programming libraries in order to abstract the core principles of probabilistic programming.Experimentation, Non-Compliance and Instrumental Variables with PyMC
/iv_pymc/
Mon, 20 Feb 2023 00:00:00 +0000/iv_pymc/In this notebook we present an example of how to use PyMC to estimate the effect of a treatment in an experiment where there is non-compliance through the use of instrumental variables.
By non-compliance we mean that the treatment assignment does not guarantee that the treatment is actually received by the treated. The main challenge is that we can not simply estimate the treatment effect as a difference in means since the non-compliance mechanism is most of the time not at random and may introduce confounders.Cohort Revenue & Retention Analysis: A Bayesian Approach
/revenue_retention/
Mon, 23 Jan 2023 00:00:00 +0000/revenue_retention/In this notebook we extend the cohort retention model presented in the post Cohort Retention Analysis with BART so that we just model retention and per cohort simultaneously (we recommend reading the referenced post before this one). The idea is to keep modeling the retention using a Bayesian Additive Regression Tree (BART) model (see pymc-bart) and linearly model the revenue per cohort using a Gamma distribution. We couple the retention and revenue components in a similar way as presented in the notebook Introduction to Bayesian A/B Testing.Cohort Retention Analysis with BART
/retention_bart/
Mon, 02 Jan 2023 00:00:00 +0000/retention_bart/In this notebook we study an alternative approach for the cohort analysis problem presented in A Simple Cohort Retention Analysis in PyMC. Instead of using a linear model to estimate the retention rate, we use a Bayesian Additive Regression Tree (BART) model(see pymc-bart). The BART model is a flexible non-parametric model that can be used to model complex relationships between the response and the predictors.
Prepare Notebook import arviz as az import matplotlib.A Simple Cohort Retention Analysis in PyMC
/retention/
Tue, 20 Dec 2022 00:00:00 +0000/retention/In this notebook we present a simple approach to study cohort retention analysis through a simulated data set. The aim is to understand how retention rates change over time and provide a simple model to predict them (with uncertainty estimates!). We do not expect this technique to be a silver bullet for all retention problems, but rather a simple approach to get started with the problem.
Remark: A motivation for this notebook was the great post Bayesian Age/Period/Cohort Models in Python with PyMC by Austin Rochford.Geo-Experimentation via Time Based Regression in PyMC
/time_based_regression_pymc/
Thu, 01 Dec 2022 00:00:00 +0000/time_based_regression_pymc/Introduction In this notebook I describe and present an implementation of the time based regression (TBR) approach to marketing campaign analysis in the context of geo experimentation presented in the paper Estimating Ad Effectiveness using Geo Experiments in a Time-Based Regression Framework by Jouni Kerman, Peng Wang and Jon Vaver (Google, Inc. 2017). I strongly recommend reading the paper as it is quite clear in the exposition of the approach and presents some simulation results.Offline Campaign Analysis Measurement: A journey through causal impact, geo-experimentation and synthetic control
/wolt_ds_meetup/
Tue, 25 Oct 2022 00:00:00 +0000/wolt_ds_meetup/In October 2022 I had the opportunity to give a talk at the Helsinki Data Science Meetup hosted by Wolt. Here I want to share the recording of my talk.
Abstract: The talk will show how to measure offline campaigns using causal inference techniques. In particular it’ll focus on tapping into the potential of synthetic control, geo-experiments via time-based regression, and Google’s Causal-Impact Method.
Code to generate data You can find the raw data here and the code here.Scikit-Learn Example in PyMC: Gaussian Process Classifier
/sklearn_pymc_classifier/
Sat, 24 Sep 2022 00:00:00 +0000/sklearn_pymc_classifier/In this notebook we want to describe how to port a couple of classification examples from scikit-learn’s documentation (classifier comparison) to PyMC. We focus in the classical moons synthetic dataset.
Prepare Notebook import arviz as az import matplotlib.pyplot as plt import numpy as np import pandas as pd import pymc as pm import pymc.sampling_jax import seaborn as sns from sklearn.datasets import make_moons from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split plt.Synthetic Control in PyMC
/synthetic_control_pymc/
Tue, 09 Aug 2022 00:00:00 +0000/synthetic_control_pymc/Synthetic control can be considered “the most important innovation in the policy evaluation literature in the last few years” (see The State of Applied Econometrics: Causality and Policy Evaluation by Susan Athey and Guido W. Imbens).
In this notebook we provide an example of how to implement a synthetic control problem in PyMC to answer a “what if this had happened?” type of question in the context of causal inference.Modeling Short Time Series with Prior Knowledge in PyMC
/short_time_series_pymc/
Tue, 19 Jul 2022 00:00:00 +0000/short_time_series_pymc/In this notebook I want to reproduce in PyMC the methodology described in the amazing blog post Modeling Short Time Series with Prior Knowledge by Tim Radtke to forecast short time series using bayesian transfer learning 🚀. The main idea is to transfer information (e.g. long term seasonality) from a long time series to a short time series via prior distributions. Tim’s blog post treats a very concrete example where all the concepts become very concrete.Time-Varying Regression Coefficients via Gaussian Random Walk in PyMC
/bikes_pymc/
Sun, 03 Jul 2022 00:00:00 +0000/bikes_pymc/In this notebook we want to illustrate how to use PyMC to fit a time-varying coefficient regression model. The motivation comes from post Exploring Tools for Interpretable Machine Learning where we studied a time series problem, regarding the prediction of the number of bike rentals, from a machine learning perspective. Concretely, we fitted and compared two machine learning models: a linear regression with interactions and a gradient boost model (XGBoost).Data Talks Club: Machine Learning in Marketing
/machine_learning_marketing/
Tue, 17 May 2022 00:00:00 +0000/machine_learning_marketing/On Friday 13th of May 2022 I was invited to join Alexey Grigorev in an event organised by DataTalks.Club to talk about Machine Learning in Marketing. It was a really insightful discussion and i would like to thanks the organizers who make it possible. Here is the recording:
Here are some useful links and resources about the subject:
Relevant blog post I have written about the subject:PyConDE & PyData Berlin 2022: Introduction to Uplift Modeling
/uplift/
Mon, 11 Apr 2022 00:00:00 +0000/uplift/In this notebook we present a simple example of uplift modeling estimation via meta-models using causalml and scikit-uplift. For a more detailed introduction to uplift modeling, see:
Diemert, Eustache, et.al. (2020) “A Large Scale Benchmark for Uplift Modeling”
Gutierrez, P., & Gérardy, J. Y. (2017). “Causal Inference and Uplift Modelling: A Review of the Literature”
Karlsson, H. (2019) “Uplift Modeling: Identifying Optimal Treatment Group Allocation and Whom to Contact to Maximize Return on Investment”Gamma-Gamma Model of Monetary Value in PyMC
/gamma_gamma_pymc/
Tue, 29 Mar 2022 00:00:00 +0000/gamma_gamma_pymc/In this notebook we describe how to fit Fader’s and Hardie’s gamma-gamma model presented in the paper “RFM and CLV: Using Iso-value Curves for Customer Base Analysis” and the note “The Gamma-Gamma Model of Monetary Value”. The approach is very similar as the one presented in the previous post BG/NBD Model in PyMC where we simply ported the log-likelihood of the lifetimes package from numpy to pytensor.
Prepare Notebook import arviz as az import matplotlib.BG/NBD Model in PyMC
/bg_nbd_pymc/
Thu, 03 Mar 2022 00:00:00 +0000/bg_nbd_pymc/In this notebook we show how to port the BG/NBD model from the the lifetimes (developed mainly by Cameron Davidson-Pilon) package to pymc. The BG/NBD model, introduced in the seminal paper “Counting Your Customers” the Easy Way: An Alternative to the Pareto/NBD Model by Peter S. Fader, Bruce G. S. Hardie and Ka Lok Lee in 2005, is used to
predict future purchasing patterns, which can then serve as an input into “lifetime value” calculations, in the “non-contractual” setting (i.Media Effect Estimation with PyMC: Adstock, Saturation & Diminishing Returns
/pymc_mmm/
Fri, 11 Feb 2022 00:00:00 +0000/pymc_mmm/In this notebook we present a concrete example of estimating the media effects via bayesian methods, following the strategy outlined in Google’s paper Jin, Yuxue, et al. “Bayesian methods for media mix modeling with carryover and shape effects.” (2017). This example can be considered the continuation of the post Media Effect Estimation with Orbit’s KTR Model. However, it is not strictly necessary to read before as we make this notebook self-contained.Media Effect Estimation with Orbit's KTR Model
/orbit_mmm/
Fri, 04 Feb 2022 00:00:00 +0000/orbit_mmm/In this notebook we want to experiment to the new KTR model included in the new orbit’s release (1.1). In particular, we are interested in its applications to media effects estimation in the context of media mix modeling. This is one of the applications for the KTR model by the Uber’s team, see the corresponding paper Edwin, Ng, et al. “Bayesian Time Varying Coefficient Model with Applications to Marketing Mix Modeling”.Unobserved Components Model as a Bayesian Model with PyMC
/uc_pymc/
Fri, 10 Dec 2021 00:00:00 +0000/uc_pymc/In this notebook I want to deep-dive into the idea of wrapping a statsmodels UnobservedComponents model as a bayesian model with PyMC described in the (great!) post Fast Bayesian estimation of SARIMAX models. This is a nice excuse to get into some internals of how PyMC works. I hope this can serve as a complement to the original post mentioned above. This post has two parts: In the first one we fit a UnobservedComponents model to a simulated time series.ISLR2 - Survival Analysis Lab (lifelines)
/islr2_survival_analysis/
Wed, 01 Sep 2021 00:00:00 +0000/islr2_survival_analysis/In this notebook we provide a python implementation of the lab from the Survival Analysis - Chapter 11 of the second edition of the book An Introduction to Statistical Learning (Second Edition). You can find a free pdf version of the book here. We will use the lifelines python package, which you can find in this repository. There is a nice introduction into survival analysis on the documentation. There are also many concrete examples and guidelines to use the package.Exploring Tools for Interpretable Machine Learning
/interpretable_ml/
Thu, 01 Jul 2021 00:00:00 +0000/interpretable_ml/In this notebook we want to test various ways of getting a better understanding on how non-trivial machine learning models generate predictions and how features interact with each other. This is in general not straight forward and key components are (1) understanding on the input data and (2) domain knowledge on the problem. Two great references on the subject are:
Interpretable Machine Learning, A Guide for Making Black Box Models Explainable by Christoph Molnar Interpretable Machine Learning with Python by Serg Masís Note that the methods discussed in this notebook are not related with causality.Feature Engineering: patsy as FormulaTransformer
/formula_transformer/
Sat, 01 May 2021 00:00:00 +0000/formula_transformer/In this notebook I want to describe how to create features inside scikit-learn pipelines using patsy-like formulas. I have used this approach to generate features in a previous post: GLM in PyMC3: Out-Of-Sample Predictions, so I will consider the same data set here for the sake of comparison.
Remark: Very recently (2021-09-01) I discovered there is an implementation of this transformer in scikit-lego, see PatsyTransformer. In addition, please refer to the great tutorial on patsy in calmcode.GLM in PyMC3: Out-Of-Sample Predictions
/glm_pymc3/
Mon, 04 Jan 2021 00:00:00 +0000/glm_pymc3/In this notebook I explore the glm module of PyMC3. I am particularly interested in the model definition using patsy formulas, as it makes the model evaluation loop faster (easier to include features and/or interactions). There are many good resources on this subject, but most of them evaluate the model in-sample. For many applications we require doing predictions on out-of-sample data. This experiment was motivated by the discussion of the thread “Out of sample” predictions with the GLM sub-module on the (great!Gaussian Processes for Time Series Forecasting with PyMC3
/gp_ts_pymc3/
Sat, 02 Jan 2021 00:00:00 +0000/gp_ts_pymc3/In this notebook we translate the forecasting models developed for the post on Gaussian Processes for Time Series Forecasting with Scikit-Learn to the probabilistic Bayesian framework PyMC3. I strongly recommend looking into the following references for more details and examples:
References:
An Introduction to Gaussian Process Regression PyMC3 Docs: Gaussian Processes PyMC3 Docs Example: CO2 at Mauna Loa Bayesian Analysis with Python (Second edition) - Chapter 7 Statistical Rethinking - Chapter 14 Prepare Notebook1 import numpy as np import pandas as pd import matplotlib.Simple Bayesian Linear Regression with TensorFlow Probability
/tfp_lm/
Tue, 06 Oct 2020 00:00:00 +0000/tfp_lm/In this post we show how to fit a simple linear regression model using TensorFlow Probability by replicating the first example on the getting started guide for PyMC3. We are going to use Auto-Batched Joint Distributions as they simplify the model specification considerably. Moreover, there is a great resource to get deeper into this type of distribution: Auto-Batched Joint Distributions: A Gentle Tutorial, which I strongly recommend (see this post to get a brief introduction on TensorFlow probability distributions).Open Data: Berlin Kitas
/kitas_berlin/
Sat, 19 Sep 2020 00:00:00 +0000/kitas_berlin/In this notebook I want to explore some data I found on the Berlin Open Data portal daten.berlin.de. The data source contains information of Kitas (Kindertagesstätte, i.e. kindergartens) in Berlin. This is a big topic as finding a spot in a Kita in Berlin is extremely difficult. We first provide an initial exploratory data analysis of the data set, then we merge it with population data to create some geo-location maps.A Simple Hamiltonian Monte Carlo Example with TensorFlow Probability
/tfp_hcm/
Fri, 24 Jul 2020 00:00:00 +0000/tfp_hcm/In this post we want to revisit a simple bayesian inference example worked out in this blog post. This time we want to use TensorFlow Probability (TFP) instead of PyMC3.
References:
Statistical Rethinking is an amazing reference for Bayesian analysis. It also has a sequence of online lectures freely available on YouTube.
An introduction to probabilistic programming, now available in TensorFlow Probability
There are many examples on the TensorFlow’s GitHub repository.Regression Analysis & Visualization
/lm_viz/
Fri, 26 Jun 2020 00:00:00 +0000/lm_viz/In this notebook I want to collect some useful visualizations which can help model development and model evaluation in the context of regression analysis. I use many visualization resources not just only to share results but as a key component of my workflow: data QA, EDA, feature engineering, model development, model evaluation and communicating results. In this notebook I focus on a simple regression model (time series) with statsmodels and visualization with matplotlib and seaborn.A Glimpse into TensorFlow Probability Distributions
/intro_tfd/
Tue, 16 Jun 2020 00:00:00 +0000/intro_tfd/In this notebook we want to go take a look into the distributions module of TensorFlow probability. The aim is to understand the fundamentals and then explore further this probabilistic programming framework. Here you can find an overview of TensorFlow Probability. We will concentrate on the first part of Layer 1: Statistical Building Blocks. As you could see from the distributions module documentation, there are many classes of distributions. We will explore a small sample of them in order to get an overall overview.Disease Spread Simulation (Animation)
/infection_sim/
Tue, 28 Apr 2020 00:00:00 +0000/infection_sim/We describe how to generate a basic disease spread simulation. We explore how to do animations in Matplotlib.Getting Started with Spectral Clustering
/spectral_clustering/
Sat, 04 Apr 2020 00:00:00 +0000/spectral_clustering/In this post I want to explore the ideas behind spectral clustering. I do not intend to develop the theory. Instead, I will unravel a practical example to illustrate and motivate the intuition behind each step of the spectral clustering algorithm. I particularly recommend two references:
For an introduction/overview on the theory, see the lecture notes A Tutorial on Spectral Clustering by Prof. Dr. Ulrike von Luxburg. For a concrete application of this clustering method you can see the PyData’s talk: Extracting relevant Metrics with Spectral Clustering by Dr.The Volume of the d-Ball via Monte Carlo Simulation
/vol_d_ball/
Mon, 24 Feb 2020 00:00:00 +0000/vol_d_ball/In this notebook we run Monte Carlo simulations to estimate the volume of the \(d\)-ball \[ B^{d}:=\{x \in \mathbb{R}^d : ||x|| \leq 1\}. \] There are many ways to obtain a closed formula for this volume , see for example this Wikipedia article. Here we do it via sampling just for fun!
Main Idea Consider a square \(A_{d}\subset \mathbb{R}\) centered at the origin with side length \(2\). We estimate the volume of the \(d\)-ball \(B^{d}:=\{x \in \mathbb{R}^d : ||x|| \leq 1\}\subset A^{d}\) by sampling uniformly from \(A\) and computing the proportions of vectors having length less or equal than one.Forecasting Weekly Data with Prophet
/fb_prophet/
Fri, 21 Feb 2020 00:00:00 +0000/fb_prophet/In this notebook we are present an initial exploration of the Prophet package by Facebook. From the documentation:
Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.Exploring TensorFlow Probability STS Forecasting
/intro_sts_tfp/
Tue, 11 Feb 2020 00:00:00 +0000/intro_sts_tfp/In this notebook we explore the Structural Time Series (STS) Module of TensorFlow Probability. We follow closely the use cases presented in their Medium blog. As described there: An STS model expresses an observed time series as the sum of simpler components 1: \[ f(t) = \sum_{k=1}^{N}f_{k}(t) + \varepsilon, \quad \text{where}\quad \varepsilon \sim N(0, \sigma^2). \]
Each summand \(f_{k}(t)\) has a particular structure, e.g. specific seasonality, trend, autoregressive terms, etc.Intro ML in Production: Flask, Docker and GitHub Actions
/ml_prod_intro/
Tue, 28 Jan 2020 00:00:00 +0000/ml_prod_intro/We describe how to set up a toy-model repository to train and dockerize a machine learning model with data store on aws s3.Drawing Manifolds in LaTeX with TikZ
/manifold_fig_latex/
Fri, 10 Jan 2020 00:00:00 +0000/manifold_fig_latex/We give some LaTex code to create figures of manifolds with boundaries.Open Data: Germany Maps Viz
/germany_plots/
Tue, 07 Jan 2020 00:00:00 +0000/germany_plots/In this post I want to show how to use public available (open) data to create geo visualizations in python. Maps are a great way to communicate and compare information when working with geolocation data. There are many frameworks to plot maps, here I focus on matplotlib and geopandas (and give a glimpse of mplleaflet).
Reference: A very good introduction to matplotlib is the chapter on Visualization with Matplotlib from the Python Data Science Handbook by Jake VanderPlas.The Graph Laplacian & Semi-Supervised Clustering
/semi_supervised_clustering/
Thu, 05 Dec 2019 00:00:00 +0000/semi_supervised_clustering/In this post we want to explore the semi-supervided algorithm presented Eldad Haber in the BMS Summer School 2019: Mathematics of Deep Learning, during 19 - 30 August 2019, at the Zuse Institute Berlin. He developed an implementation in Matlab which you can find in this GitHub repository. In addition, please find the corresponding slides here.
Prepare Notebook import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns; sns.The Lapacian on the 2-Torus
/laplacian_2torus/
Sun, 13 Oct 2019 00:00:00 +0000/laplacian_2torus/In this blog post I want to describe the explicit computation of the Laplacian on differential forms on the \(2\)-Torus \(T^2\subset \mathbb{R}^3\). This surface can be obtained by rotating the circle \((x-a)^2+y^2=r^2\) around the \(z\)-axis (\(0<r<a\)). Locally, this surface can be parametrized by the equations \[ x = (a+r\cos u)\cos v,\\ y = (a+r\cos u)\sin v,\\ z = r\sin u, \]
where \(0<u,v<2\pi\).PyData Berlin 2019: Gaussian Processes for Time Series Forecasting (scikit-learn)
/gaussian_process_time_series/
Thu, 10 Oct 2019 00:00:00 +0000/gaussian_process_time_series/In this notebook we run some experiments to demonstrate how we can use Gaussian Processes in the context of time series forecasting with scikit-learn. This material is part of a talk on Gaussian Process for Time Series Analysis presented at the PyCon DE & PyData 2019 Conference in Berlin.
Update: Additional material and plots were included for the Second Symposium on Machine Learning and Dynamical Systems at The Fields Institute (virtual event).satRday Berlin 2019: Remedies for Severe Class Imbalance
/class_imbalance/
Sat, 15 Jun 2019 00:00:00 +0000/class_imbalance/In this post I present a concrete case study illustrating some techniques to improve model performance in class-imbalanced classification problems. The methodologies described here are based on Chapter 16: Remedies for Severe Class Imbalance of the (great!) book Applied Predictive Modeling by Max Kuhn and Kjell Johnson. I absolutely recommend this reference to anyone interested in predictive modeling.
This notebook should serve as an extension of my talk given at satRday Berlin 2019: A conference for R users in Berlin.Seasonal Bump Functions
/bump_func/
Thu, 11 Apr 2019 00:00:00 +0000/bump_func/Motivated by the nice talk on Winning with Simple, even Linear, Models by Vincent D. Warmerdam, I briefly describe how to construct certain class of bump functions to encode seasonal variables in R.
Prepare Notebook library(glue) library(lubridate) library(magrittr) library(tidyverse) Generate Data Let us generate a time sequence variable stored in a tibble.
# Define time sequence. t <- seq.Date(from = as.Date("2017-07-01"), to = as.Date("2019-04-01"), by = "day") # Store it in a tibble.An Introduction to Gaussian Process Regression
/gaussian_process_reg/
Mon, 08 Apr 2019 00:00:00 +0000/gaussian_process_reg/Updated Version: 2019/09/21 (Extension + Minor Corrections)
After a sequence of preliminary posts (Sampling from a Multivariate Normal Distribution and Regularized Bayesian Regression as a Gaussian Process), I want to explore a concrete example of a gaussian process regression. We continue following Gaussian Processes for Machine Learning, Ch 2.
Other recommended references are:
Gaussian Processes for Timeseries Modeling by S. Roberts, M. Osborne, M. Ebden, S. Reece, N. Gibson & S.Bayesian Regression as a Gaussian Process
/reg_bayesian_regression/
Mon, 01 Apr 2019 00:00:00 +0000/reg_bayesian_regression/In this post we study the Bayesian Regression model to explore and compare the weight and function space and views of Gaussian Process Regression as described in the book Gaussian Processes for Machine Learning, Ch 2. We follow this reference very closely (and encourage to read it!). Our main objective is to illustrate the concepts and results through a concrete example. We use PyMC3 to run bayesian sampling.
References:Sampling from a Multivariate Normal Distribution
/multivariate_normal/
Sat, 23 Mar 2019 00:00:00 +0000/multivariate_normal/In this post I want to describe how to sample from a multivariate normal distribution following section A.2 Gaussian Identities of the book Gaussian Processes for Machine Learning. This is a first step towards exploring and understanding Gaussian Processes methods in machine learning.
Multivariate Normal Distribution Recall that a random vector \(X = (X_1, , X_d)\) has a multivariate normal (or Gaussian) distribution if every linear combination
\[ \sum_{i=1}^{d} a_iX_i, \quad a_i\in\mathbb{R} \] is normally distributed.Dockerize a ShinyApp
/dockerize-a-shinyapp/
Sat, 02 Mar 2019 00:00:00 +0000/dockerize-a-shinyapp/In this post I want to describe how to dockerize a simple Shiny App. Docker is a great way of sharing and deploying projects. You can download it here.
Resources:
R Docker tutorial, recommended for Docker beginners. Running a shiny app in a docker container by Mark Sellors (which is an updated and more complete version of this post). Assume you have a project folder structure as follows:The Spectral Theorem for Matrices
/the-spectral-theorem-for-matrices/
Sat, 02 Feb 2019 00:00:00 +0000/the-spectral-theorem-for-matrices/When working in data analysis it is almost impossible to avoid using linear algebra, even if it is on the background, e.g. simple linear regression. In this post I want to discuss one of the most important theorems of finite dimensional vector spaces: the spectral theorem. The objective is not to give a complete and rigorous treatment of the subject, but rather show the main ingredients, some examples and applications.Movie Plots Text Generation with Keras
/movie_plot_text_gen/
Sun, 13 Jan 2019 00:00:00 +0000/movie_plot_text_gen/In this post I show some text generation experiments I ran using LSTM with Keras. For the preprocessing and tokenization I used SpaCy. The aim is not to present a completed project, but rather a first step which should be then iterated.
Resources There are many great resources and blog posts about the subject (and similar experiments). Here I mention the ones I found particularly useful for the general theory:Exploring the Curse of Dimensionality - Part II.
/exploring-the-curse-of-dimensionality-part-ii./
Tue, 01 Jan 2019 00:00:00 +0000/exploring-the-curse-of-dimensionality-part-ii./I continue exploring the curse of dimensionality. Following the analysis form Part I., I want to discuss another consequence of sparse sampling in high dimensions: sample points are close to an edge of the sample. This post is based on The Elements of Statistical Learning, Section 2.5, which I encourage to read!
Uniform Sampling Consider \(N\) data points uniformly distributed in a \(p\)-dimensional unit ball centered at the origin. Suppose we consider a nearest-neighbor estimate at the origin.Text Mining, Networks and Visualization: Plebiscito Tweets
/text-mining-networks-and-visualization-plebiscito-tweets/
Thu, 20 Dec 2018 00:00:00 +0000/text-mining-networks-and-visualization-plebiscito-tweets/Nowadays social media generates a vast amount of raw data (text, images, videos, etc). It is a very interesting challenge to discover techniques to get insights on the content and development of social media data. In addition, as a fundamental component of the analysis, it is important to find ways of communicating the results, i.e. data visualization. In this post I want to present a small case study where I analyze Twitter text data.Exploring the Curse of Dimensionality - Part I.
/exploring-the-curse-of-dimensionality-part-i./
Sun, 09 Dec 2018 00:00:00 +0000/exploring-the-curse-of-dimensionality-part-i./In this post I want to present the notion of curse of dimensionality following a suggested exercise (Chapter 4 - Ex. 4) of the book An Introduction to Statistical Learning, written by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.
When the number of features \(p\) is large, there tends to be a deterioration in the performance of KNN and other local approaches that perform prediction using only observations that are near the test observation for which a prediction must be made.From Pelican to Blogdown
/pelican_to_blogdown/
Sun, 02 Dec 2018 00:00:00 +0000/pelican_to_blogdown/Here I want to discuss my transition from Pelican to Blogdown and present some personal learnings. In June 2017 I decided to build a personal website/portafolio. I chose Pelican, because:
It is written in Python, which was the programing language I was mainly working on.
I wanted to include some Jupyter notebook I had already written.
A great post: Building a data science portfolio: Making a data science blog explaining the procedure and using GitHub Pages to publist it.\(S^1\)-Equivariant Dirac operators on the Hopf Fibration
/hopf_fibration/
Sun, 11 Nov 2018 00:00:00 +0000/hopf_fibration/In this expository article I discuss the definition and basic properties of the Hopf fibration, with particular emphasis on Dirac-type operators induced, in the sense of Brüning and Heintze, by the Hodge-de Rham and spin-Dirac operators. In addition, we compute the Dirac-Schrödinger type operator introduced in my PhD thesis.Introduction to R Plumber : Expose a Caret model to a web API
/intro_plumber/
Fri, 12 Oct 2018 00:00:00 +0000/intro_plumber/In this post we present a simple example of how to expose a prediction model to a web API using the Plumber package.Circle Radius Fit for a Cloud of Points
/circle-radius-fit-for-a-cloud-of-points/
Sun, 09 Sep 2018 00:00:00 +0000/circle-radius-fit-for-a-cloud-of-points/We explore how to include an R notebook into a pelican post. As an example, we describe how to fit a circle onto a cloud of points.From Bachelor to PhD: Geometric and Topological Methods for Quantum Field Theory
/vdl_experience/
Thu, 02 Aug 2018 00:00:00 +0000/vdl_experience/We give an introduction to PyMC3, a probabilistic programming framework written in Python. We revise the basic mathematical theory and present two concrete examples.PyData Berlin 2018: On Laplacian Eigenmaps for Dimensionality Reduction
/laplacian_eigenmaps_dim_red/
Sun, 08 Jul 2018 00:00:00 +0000/laplacian_eigenmaps_dim_red/This post contains the slides and material from a talk I gave at PyData Berlin 2018. I presented the paper <em>Laplacian Eigenmaps for Dimensionality Reduction and Data Representation</em> by <a href="http://web.cse.ohio-state.edu/~belkin.8/">Mikhail Belkin</a> and <a href="http://people.cs.uchicago.edu/~niyogi/">Partha Niyogi</a>.Probability that a given observation is part of a bootstrap sample?
/bootstrap/
Wed, 29 Nov 2017 00:00:00 +0000/bootstrap/We study the problem of computing the probability that a given observation is part of a bootstrap sample. We include some numerical simulations.Induced Dirac-Schrödinger operators on semi-free circle quotients
/phd/
Sat, 11 Nov 2017 00:00:00 +0000/phd/I present the content of my PhD Thesis in mathematics, which has now been published in The Journal of Geometric Analysis.Introduction to Bayesian Modeling with PyMC3
/intro_pymc3/
Sun, 13 Aug 2017 00:00:00 +0000/intro_pymc3/We give an introduction to PyMC3, a probabilistic programming framework written in Python. We revise the basic mahematical theory and present two concrete examples.Web scraping with Beautiful Soup: Plebiscito Colombia (October 2nd)
/plebiscito/
Sun, 09 Jul 2017 00:00:00 +0000/plebiscito/We describe how to use Beautiful Soup to scrape the official goverment website in order to get the results of the peace referendum in Colombia.The Dirac operator on the 2-sphere
/the-dirac-operator-on-the-2-sphere/
Thu, 29 Jun 2017 00:00:00 +0000/the-dirac-operator-on-the-2-sphere/The objective of this post is to explore MathJax, a JavaScript display engine for LaTeX. Being my first post writen with this tool, I want to present a short but fun example: I will give a description of the explicit computation of the spin-Dirac operator (of the unique complex spinor bundle!) on the 2-sphere \(S^2\) equipped with the standar round metric. A more detailed treatment can be found in my expository paper.Python Exercise: Distance to Rectangle
/rectangle/
Wed, 28 Jun 2017 00:00:00 +0000/rectangle/In this first post we get started with a small python script to explore the basic capabilities of Pelican.About
/about/
Mon, 01 Jan 0001 00:00:00 +0000/about/Dr. Juan Camilo Orduz Mathematician & Data Scientist
Photo taken at PyConDE & PyData Berlin 2024. I am a mathematician (PhD from Humboldt Universität zu Berlin) and Data Scientist (Wolt and PyMC Labs). I am also involved in open source projects like PyMC and PyMC-Marketing (among others).
On this website you can find more information about me and some of my projects.
You can find the code associated with the blog posts on this GitHub repository.Curriculum Vitae
/cv/
Mon, 01 Jan 0001 00:00:00 +0000/cv/Work Experience 01/04/2024-Present Data Scientist / Open Source Developer, PyMC Labs, Remote.
01/08/2029-Present Senior Data Scientist (Pricing & Forecasting), Wolt, Berlin, Germany.
End-to-end deployed product-level (∼ 100K time series) demand forecasting models for Wolt Market. Scoping (product requirements and simulations) and planning pricing experimentation frameworks. 01/10/2021-31/07/2023 Senior Data Scientist (Marketing Tech), Wolt, Berlin, Germany.
Member of the marketing tech team, a cross-functional product team. I am leading the data science projects from conceptualization, and modeling to deployment.