Reproducible infrastructure for Data Scientist

You should always care about the reproducibility of your data analysis. A lot of related resources focus on how the data is processed. This article is about maintaining unified data-scientist technology stack.

An original version of the text was published on Kaggle blog. The main idea is to make use of Docker containers as a central place of storing all necessary libraries and tools. Later on, it is possible to import and export such container on different environments making sure that everything works exactly the same.

These are alternative commands that can be used to spin up the container.

Mind the differences between the original link:

  • fit to work on *nix systems,
  • no explicit Docker Machine is needed,
  • ability to render graphics (plots, etc)


  • Make sure that the Docker Engine is installed.
  • Download the kaggle/python image (notice that it is nearly 8GB)

  • At the end of ~/.bashrc add new aliases:

  • Reload the shell

Those 3 commands are responsible for spinning instant containers executing desired Python tasks.

Happy coding.


Let's combine software craftsmanship and data engineering skills results to produce some clean and understandable code.
  • renjithmadhavan

    Is this using the default container.

    • Hi Renjtih, all 3 commands will use kaggle/python image for the container.

      • renjithmadhavan

        so it seems like windows is not the ideal environment for docker. I was trying this in windows 10 pro and docker and I ran into issues what not.

        • I have mention in the article that commands are fitted to work on *nix systems. You would have to adjust some things (such as directory paths or environment variables to make it work on Windows)

          • renjithmadhavan

            ohh yeah, I saw that. But finally gave up on windows after reaching my frustration threshold. The learning curve in windows would be longer for me at this point of time. Actually the last challenging part was to get the volumes mounted using -v switch. Good thing I switched to ubuntu and it worked like a charm. Thank you.

  • Sagar Patel

    Was so much frustrated after following the original kaggle post for hours and still not getting my system to work. But your commands worked like a charm. Thank you very much.

  • Do you store your notebooks and datasets under /tmp/working or somewhere else? (as after a reboot of the mac everything under that /tmp gets flushed away.). Also, please what directory do you download Kaggle datasets to (that you can read from the notebook)?

    • Norbert Kozlowski

      The command

      will mount everything from your current directory into

      inside container.

      In your case you probably want to run a command from the directory where are all required files.