The role of a Data Scientist has evolved very fast over time, and it is going to continue changing by its own nature. Nevertheless, what many people agree on is that the true value of data science is deploying models into production where they have a concrete impact (I am not saying that static analyses/reports are not important!). As a matter of fact, in practical machine learning projects the time spent in actually building the models is much less as compared with the data requirements setup, data processing and feature engineering phases. The final step, deploying the model into a production system, often comes as a collaboration between data scientist and data engineers. It is therefore very important for data scientist to develop skills which makes such collaboration easy and productive. Such skills include, among many version control (git) and Docker.
There are many interesting resources out there on this topic. Last week I was lucky to find one which I particularly liked: http://mlinproduction.com/ by Luigi Patruno. He has a sequence of blog post ilustrating how the path ml model -> Docker -> Kubernetes
. This motivated me to write a GitHub repository (github.com/juanitorduz/ml_prod_tutorial) on this topic as a complement to his blog. I focused on his posts about dockerizing a machine learning models:
- Docker for Machine Learning – Part I
- Docker for Machine Learning – Part II
- Docker for Machine Learning – Part III
- Using Docker to Generate Machine Learning Predictions in Real Time
The core of this repository is based on this reference and I encourage everyone to read them before going into the code if you do not have much experience on this topic.
In addition to the main functionalities, I also wanted to complement it with extra features which are not discussed on Luigi’s posts:
Train the machine learning model with data stored in a private AWS S3 bucket. In particular, when building the Docker image, do not copy the training data into it. (Most of the Docker tutorials either use dummy data from a package or a file stored in the repository, which is not how works in real applications).
Describe how to build a Docker image passing the credentials as
ARG
variables and not asENV
variables. This is particular special for security reasons.Unit Testing (needs more tests).
- Set up GitHub Actions:
- Docker container GitHub Action. In particular, how to access credentials using GitHub Secrets.
- Python (pytest) Action.
- Codecov GitHub Action to get reports on tests coverage.
which get trigger on
push
and ensures reliability on the code.
I have also added a Resources section where I store useful references, interesting reading and similar approaches.
Remark: This repository structure should be seen as a minimal toy-use-case. You might want to package the files correctly, using for example Cookiecutter Data Science.
Deployment: This is out of the scope of this repository. There are many ways of doing this! See, for example, Deploying Elastic Beanstalk applications from Docker containers.
Contributing
I will keep adding functionalities to this toy model repository. If you have some other relevant resources, suggestions, comments or find bugs please create a Pull Request or drop me a line.