You should always care about the reproducibility of your data analysis. A lot of related resources focus on how the data is processed. This article is about maintaining unified data-scientist technology stack.
An original version of the text was published on Kaggle blog. The main idea is to make use of Docker containers as a central place of storing all necessary libraries and tools. Later on, it is possible to import and export such container on different environments making sure that everything works exactly the same.
These are alternative commands that can be used to spin up the container.
Mind the differences between the original link:
- fit to work on *nix systems,
- no explicit Docker Machine is needed,
- ability to render graphics (plots, etc)
- Make sure that the
Docker Engineis installed.
- Download the
kaggle/pythonimage (notice that it is nearly 8GB)
docker pull kaggle/python
- At the end of
~/.bashrcadd new aliases:
docker run -v $PWD:/tmp/working -v /tmp/.X11-unix:/tmp/.X11-unix -w=/tmp/working -e DISPLAY=$DISPLAY --rm -it kaggle/python python "$@"
docker run -v $PWD:/tmp/working -v /tmp/.X11-unix:/tmp/.X11-unix -w=/tmp/working -e DISPLAY=$DISPLAY --rm -it kaggle/python ipython
(sleep 3 && sensible-browser "http://127.0.0.1:8888")&
docker run -v $PWD:/tmp/working -w=/tmp/working -p 8888:8888 --rm -it kaggle/python jupyter notebook --no-browser --ip=0.0.0.0 --notebook-dir=/tmp/working
- Reload the shell
Those 3 commands are responsible for spinning instant containers executing desired Python tasks.