Dr. Juan Camilo Orduz
/
Recent content on Dr. Juan Camilo OrduzHugo -- gohugo.ioen-usSat, 15 Jun 2019 00:00:00 +0000satRday Berlin 2019: Remedies for Severe Class Imbalance
/class_imbalance/
Sat, 15 Jun 2019 00:00:00 +0000/class_imbalance/In this post I present a concrete case study illustrating some techniques to improve model performance in class-imbalanced classification problems. The methodologies described here are based on Chapter 16: Remedies for Severe Class Imbalance of the (great!) book Applied Predictive Modeling by Max Kuhn and Kjell Johnson. I absolutely recommend this reference to anyone interested in predictive modeling.
This notebook should serve as an extension of my talk given at satRday Berlin 2019: A conference for R users in Berlin.Seasonal Bump Functions
/bump_func/
Thu, 11 Apr 2019 00:00:00 +0000/bump_func/Motivated by the nice talk on Winning with Simple, even Linear, Models by Vincent D. Warmerdam, I briefly describe how to construct certain class of bump functions to encode seasonal variables in R.
Prepare Notebook library(glue) library(lubridate) library(magrittr) library(tidyverse) Generate Data Let us generate a time sequence variable stored in a tibble.
# Define time sequence. t <- seq.Date(from = as.Date("2017-07-01"), to = as.Date("2019-04-01"), by = "day") # Store it in a tibble.An Introduction to Gaussian Process Regression
/gaussian_process_reg/
Mon, 08 Apr 2019 00:00:00 +0000/gaussian_process_reg/After a sequence of preliminary posts (Sampling from a Multivariate Normal Distribution and Regularized Bayesian Regression as a Gaussian Process), I want to expole a concrete example of a gaussian process regression. We continue following Gaussian Processes for Machine Learning, Ch 2, by C. E. Rasmussen and C. K. I. Williams. Another recomended reference is the work on Gaussian Processes for Timeseries Modeling by S. Roberts, M. Osborne, M. Ebden, S.Regularized Bayesian Regression as a Gaussian Process
/reg_bayesian_regression/
Mon, 01 Apr 2019 00:00:00 +0000/reg_bayesian_regression/In this post we study the Regularized Bayesian Regression model to explore and compare the weight and function space and views of Gaussian Process Regression as described in the book Gaussian Processes for Machine Learning, Ch 2. We follow this reference very closely (and encourage to read it!). Our main objective is to illustrate the concepts and results through a concrete example. We use PyMC3 to run bayesian sampling.Sampling from a Multivariate Normal Distribution
/multivariate_normal/
Sat, 23 Mar 2019 00:00:00 +0000/multivariate_normal/In this post I want to describe how to sample from a multivariate normal distribution following section A.2 Gaussian Identities of the book Gaussian Processes for Machine Learning. This is a first step towards exploring and understanding Gaussian Processes methods in machine learning.
Multivariate Normal Distribution Recall that a random vector \(X = (X_1, \cdots, X_d)\) has a multivariate normal (or Gaussian) distribution if every linear combination
$$ \sum_{i=1}^{d} a_iX_i, \quad a_i\in\mathbb{R} $$ is normally distributed.Dockerize a ShinyApp
/dockerize-a-shinyapp/
Sat, 02 Mar 2019 00:00:00 +0000/dockerize-a-shinyapp/In this post I want to describe how to dockerize a simple Shiny App. Docker is a great way of sharing and deploying projects. You can download it here.
I highly recommend the R Docker tutorial.
Assume you have a project folder structure as follows:
. +-- project.Rproj +-- app.R +-- R | +-- script_1.R | +-- script_2.R +-- data | +-- data_df.rds | +-- raw_data.csv The script app.The Spectral Theorem for Matrices
/the-spectral-theorem-for-matrices/
Sat, 02 Feb 2019 00:00:00 +0000/the-spectral-theorem-for-matrices/When working in data analysis it is almost impossible to avoid using linear algebra, even if it is on the background, e.g. simple linear regression. In this post I want to discuss one of the most important theorems of finite dimensional vector spaces: the spectral theorem. The objective is not to give a complete and rigorous treatment of the subject, but rather show the main ingredientes, some examples and applications.Movie Plots Text Generation with Keras
/movie_plot_text_gen/
Sun, 13 Jan 2019 00:00:00 +0000/movie_plot_text_gen/In this post I show some text generation experiments I ran using LSTM with Keras. For the preprocessing and tokenization I used SpaCy. The aim is not to present a completed project, but rather a first step which should be then iterated.
Resources There are many great resources and blog posts about the subject (and similar experiments). Here I mention the ones I found particularly useful for the general theory:Exploring the Curse of Dimensionality - Part II.
/exploring-the-curse-of-dimensionality-part-ii./
Tue, 01 Jan 2019 00:00:00 +0000/exploring-the-curse-of-dimensionality-part-ii./I continue exploring the curse of dimensionality. Following the analysis form Part I., I want to discuss another consequence of sparse sampling in high dimensions: sample points are close to an edge of the sample. This post is based on The Elements of Statistical Learning, Section 2.5, which I encourage to read!
Uniform Sampling Consider \(N\) data points uniformly distributed in a \(p\)-dimensional unit ball centered at the origin. Suppose we consider a nearest-neighbor estimate at the origin.Text Mining, Networks and Visualization: Plebiscito Tweets
/text-mining-networks-and-visualization-plebiscito-tweets/
Thu, 20 Dec 2018 00:00:00 +0000/text-mining-networks-and-visualization-plebiscito-tweets/Nowadays social media generates a vast amount of raw data (text, images, videos, etc). It is a very interesting challenge to discover techniques to get insights on the content and development of social media data. In addition, as a fundamental component of the analysis, it is important to find ways of communicating the results, i.e. data visualization. In this post I want to present a small case study where I analyze Twitter text data.Exploring the Curse of Dimensionality - Part I.
/exploring-the-curse-of-dimensionality-part-i./
Sun, 09 Dec 2018 00:00:00 +0000/exploring-the-curse-of-dimensionality-part-i./In this post I want to present the notion of curse of dimensionality following a suggested excercise (Chapter 4 - Ex. 4) of the book An Introduction to Statistical Learning, writen by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.
When the number of features \(p\) is large, there tends to be a deterioration in the performance of KNN and other local approaches that perform prediction using only observations that are near the test observation for which a prediction must be made.From Pelican to Blogdown
/pelican_to_blogdown/
Sun, 02 Dec 2018 00:00:00 +0000/pelican_to_blogdown/Here I want to discuss my transition from Pelican to Blogdown and present some personal learnings. In June 2017 I decided to build a personal website/portafolio. I chose Pelican, because:
It is written in Python, which was the programing language I was mainly working on.
I wanted to include some Jupyter notebook I had already written.
A great post: Building a data science portfolio: Making a data science blog explaining the procedure and using GitHub Pages to publist it.About
/about/
Sat, 01 Dec 2018 00:00:00 +0000/about/Dr. Juan Camilo Orduz Mathematician & Data Scientist --
I have a PhD and Master degree in Mathematics from Humboldt Universität zu Berlin under the supervision of Prof. Jochen Brüning. My graduate studies were supported by the Berlin Mathematical School. Here I share my experience. Before coming to Berlin I did two bachelor degrees: Mathematics and Physics at Universidad de los Andes.
My research interests are differential geometry, topology and geometric analysis, in particular variants of the Atiyah-Singer Index Theorem for singular spaces.\(S^1\)-Equivariant Dirac operators on the Hopf Fibration
/hopf_fibration/
Sun, 11 Nov 2018 00:00:00 +0000/hopf_fibration/In this post I want to present some notes around the fundamentals of the Hopf fibration \(\pi: S^3 \longrightarrow S^2\) which I started writing during my PhD period at Humboldt Universität zu Berlin. First, I describe its definition and I show that it is a non-trivial map by computing its first Chern class. Then I give a detailed treatment of the construction of two important Dirac-type operators: the Hodge-de Rham and the spin-Dirac.Introduction to R Plumber : Expose a Caret model to a web API
/intro_plumber/
Fri, 12 Oct 2018 00:00:00 +0000/intro_plumber/In this post we explore the basics of the Plumber package. Our aim is to ilustrate how to fit a \(L^2\)-regularized linear model and expose it to a web API so that we can request predictions.
Prepare Notebook Let us load the necessary libraries.
library(caret) library(httr) library(magrittr) library(plumber) library(tidyverse) Load Data As a toy example we consider the mtcars data set.
df <- mtcars %>% as_tibble() df %>% head ## # A tibble: 6 x 11 ## mpg cyl disp hp drat wt qsec vs am gear carb ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 21 6 160 110 3.Circle Radius Fit for a Cloud of Points
/circle-radius-fit-for-a-cloud-of-points/
Sun, 09 Sep 2018 00:00:00 +0000/circle-radius-fit-for-a-cloud-of-points/In this post I explore how to render a .Rmd file directly with blogdown. To play around with it, I wrote a simple R notebook which fits a circle to a cloud of points.
Prepare the Notebook library(tidyverse) Generate Circle Data # Dimension of the space. d <- 2 # Number of sample points. N <- 1000 # Radius. R <- 4 # Generate random sample of points (x - axis).From Bachelor to PhD: Geometric and Topological Methods for Quantum Field Theory
/vdl_experience/
Thu, 02 Aug 2018 00:00:00 +0000/vdl_experience/In this post I want to write about my experience as a participant (in various occasions) of the Summer School on Geometric, Algebraic and Topological Methods in Quantum field Theory in Villa de Leyva, Colombia.
The idea of writing this post came to me in my last visit to Villa de Leyva in May 2018.
This picture was taken in front of Villa de Leyva’s main church.PyData Berlin 2018: On Laplacian Eigenmaps for Dimensionality Reduction
/laplacian_eigenmaps_dim_red/
Sun, 08 Jul 2018 00:00:00 +0000/laplacian_eigenmaps_dim_red/This summer I had the great oportunity to attend an give a talk at Pydata Berlin 2018. The topic of my talk was On Laplacian Eigenmaps for Dimensionality Reduction. During the Unsupervised Learning & Visualization session Dr. Stefan Kühn presented a very interesting and visual talk on Manifold Learning and Dimensionality Reduction for Data Visualization and Feature Engineering and Dr. Evelyn Trautmann gave a nice application of spectral clustering (here are some nice notes around this subject) in the context of Extracting relevant Metrics.Probability that a given observation is part of a bootstrap sample?
/bootstrap/
Wed, 29 Nov 2017 00:00:00 +0000/bootstrap/“The bootstrap is a widely applicable and extremely powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method.” This is how Section 5.2 of the book An Introduction to Statistical Learning, writen by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, begins. Bootstrap can be used to estimate the standard errors of the coefficients from a linear regression and is the conceptual bais for some tree ensemble algorithms.Induced Dirac-Schrödinger operators on semi-free circle quotients
/phd/
Sat, 11 Nov 2017 00:00:00 +0000/phd/In this post I present the results of my Mathematics PhD thesis which I wrote under the supervision of Prof. Dr. Jochen Brüning at Humbolt Universität zu Berlin. I gratefully acknowledge the financial support of the Berlin Mathematical School and the research project SFB 647: Space - Time - Matter
Abstract:
John Lott has computed an integer-valued signature for the orbit space of a compact orientable \((4k+1)\) manifold with a semi-free \(S^1\)-action, which is a homotopy invariant of that space, but he did not construct a Dirac type operator which has this signature as its index.Introduction to Bayesian Modeling with PyMC3
/intro_pymc3/
Sun, 13 Aug 2017 00:00:00 +0000/intro_pymc3/This post is devoted to give an introduction to Bayesian modeling using PyMC3, an open source probabilistic programming framework written in Python. Part of this material was presented in the Python Users Berlin (PUB) meet up.
Why PyMC3? As described in the documentation:
PyMC3’s user-facing features are written in pure Python, it leverages Theano to transparently transcode models to C and compile them to machine code, thereby boosting performance.Web scraping with Beautiful Soup: Plebiscito Colombia (October 2nd)
/plebiscito/
Sun, 09 Jul 2017 00:00:00 +0000/plebiscito/Web scraping: Getting referendum data using Beautiful Soup In this post I am going to describe how to get the data of the peace referendum (which happened in October the 2nd in Colombia) from the official government website using Beautiful Soup in python (this a task was suggested by Sebastian Martinez). The data is not directly available but is represened as follows:
The aim is to get the percentages of votes for each town in Colombia by scraping the website.The Dirac operator on the 2-sphere
/the-dirac-operator-on-the-2-sphere/
Thu, 29 Jun 2017 00:00:00 +0000/the-dirac-operator-on-the-2-sphere/The objective of this post is to explore MathJax, a JavaScript display engine for LaTeX. Being my first post writen with this tool, I want to present a short but fun example: I will give a description of the explicit computation of the spin-Dirac operator (of the unique complex spinor bundle!) on the 2-sphere \(S^2\) equipped with the standar round metric. A more detailed treatment can be found in my expository paper.Python Exercise: Distance to Rectangle
/rectangle/
Wed, 28 Jun 2017 00:00:00 +0000/rectangle/In this first post I wanted to explore the basics of blogging blogdown. I treat an example of a little python challenge which I encountered in first my job hunt process. I particularly like it because it is a geometric problem.
Problem Write a function that tests if a point falls within a specified distance “dist” of any part of a solid, 2D rectangle. The rectangle is specified by the bottom left corner, a width, and a height.