Reproducible infrastructure for Data Scientist

You should always care about the reproducibility of your data analysis. A lot of related resources focus on how the data is processed. This article is about maintaining unified data-scientist technology stack.

An original version of the text was published on Kaggle blog. The main idea is to make use of Docker containers as a central place of storing all necessary libraries and tools. Later on, it is possible to import and export such container on different environments making sure that everything works exactly the same.

These are alternative commands that can be used to spin up the container.

Mind the differences between the original link:

  • fit to work on *nix systems,
  • no explicit Docker Machine is needed,
  • ability to render graphics (plots, etc)


  • Make sure that the Docker Engine is installed.
  • Download the kaggle/python image (notice that it is nearly 8GB)

  • At the end of ~/.bashrc add new aliases:

  • Reload the shell

Those 3 commands are responsible for spinning instant containers executing desired Python tasks.

Happy coding.

Securing Docker container with HTTP Basic Auth


On the certain stage of developing a product, we want to make it publicly visible. Naturally, it needs to be restricted only to the privileged visitors. You might consider options like:

  • implementing custom authentication within the system,
  • configuring a server to act as a proxy between the user and the application,
  • limit the access to certain IP addresses,

We will also consider following a good practice of keeping the infrastructure as a code. There are many ways to provision the server - ChefPuppetAnsible or Docker.

This article presents the steps needed to secure a container exposing public port using an extra nginx container acting as a proxy.

Docker 1.9+ and Docker Compose are required.


Begin with the docker-compose.yml for the exemplary demo application:

You can start the application with docker-compose up -d, and then proceed to the web browser. It is expected to see "Hello World".

The architecture is shown in the diagram below. In this case, Docker is forwarding an internal 5000 port to the host 80 port.

Nginx proxy

To secure the web-app we are going to:

  • remove the port mapping from the web-app (it won't be directly accessible),
  • add extra Nginx container with custom configuration (proxy all traffic),
  • to communicate nginx with web-app we are going to make use of the networking feature introduced in Docker 1.9.

The new architecture can be expressed as follows:

Before moving on stop previously started web-app container, create a directory called nginx and modify the docker-compose.yml to match the following snippet:

User credentials

A data about users capable of accessing the web-app will be stored as a .htpasswd file.

Let's create a credentials for the admin user:

The command creates a file with all encrypted user credentials.

Proxy configuration

Inside a directory nginx create a file nginx.conf. Paste the following content:

The configuration forwards all the traffic going to the webapphost (visible due to container networking) from port 5000 to 80. Also a HTTP Auth Basic security is declared and configured here.

Run the application

Make sure that the directory structure looks the same:

If all looks good spin up the infrastructure by typing docker-compose --x-networking up -d.

Now you can access the website only after providing proper credentials.

How to perform networking match-making?


Networking events are becoming more and more popular. People are attending them expecting to find someone capable of solving their concerns.

The problem of telling "who should talk to who" is relatively easy when the number of attendees is low. In this case, conveyors can do the matching manually. However when the number of people (or a variety of skills) increases things beginning to complicate.

Desired scenario

Let's consider the following plot:

Before an event, each person fills-in the form describing his skills and needs

For example:

Later on, each person receives the information telling which people are somehow especially valuable and should talk to.


The main question is: "Who should each person talk to?"

To estimate the difficulty of the problem let's assume that there are (n) people and each one receives only three recommendations (k=3).

This gives a total number of combinations expressed with:

n \cdot C_{n-1}^{k}

That is estimated like follows:

It is obvious that the number of possible combinations is growing faster for the bigger amount of participants.

In naive approach (i.e. brute-force) each combination should be evaluated and compared with the rest (lots of computation).

Additionally, to compare the solutions (to know whether they are good or bad) they have to be somehow measured - this leads to a concept of a fitness function.

Fitness function

A fitness function (f) is generally a function taking a possible combination as an argument and returning a numeric value.

For example a fitness function for a very bad combination (lot's of mismatches) will have a very low score.

In our case the function can return values in the range from 0 (worst) to 1 (best):

0 \leq f(combination) \leq 1

In this case, a fitness function will check two things:

  • do the matched candidates have skills matching our needs?
  • how often is the analyzing person being recommended to others?

The last conditions deal with a problem of rare competencies. There is a possible scenario with a person with rare skills, and lot's of others who are needing it.

Finally, each condition has a weight assigned to it. A fact of fulfilling needs will be more important than the popularity.

 f = 0.7 \cdot needsFulfiled + 0.3 \cdot popularity

Having a way of evaluating the solution we can proceed with the algorithm.


To approach this problem a variation of an evolutionary algorithm (EA) will be used. EA are sort of meta-heuristics based on Darwinian principles of evolution.

Intuitively their workflow looks like follows:

Description of process:

  1. Randomly assign 3 people to each participant
  2. Randomly pick-up two participant
  3. Make the connection if the second one is interesting for the first one (cross-over)
  4. Randomly pick-up two participant and connect them (mutation, happens very rarely)
  5. Go to step 2

Each course from 2-5 is called an epoch or generation. An epoch is represented by the possible solution (list of all participants with their matchings) that can be also represented using a fitness function (average of all individuals).

In step 3, we are randomly selecting two people (cross-over) and trying to match the second one to the first one. A possible individual is being put either as the first, second or third match. In all cases, the overall fitness function is calculated. If a better solution than the current one if found - the matching is performed.

Step 4 represent a mutation - a very rare possibility of accepting a worse solution. The main idea is to introduce some diversity into the solution.

Each generation is evolved with a cross-over followed by a mutation:

The whole source code (written in Groovy) is available here.


The testing was performed in 3 cases:

  • 20 people,
  • 50 people,
  • 100 people

In each one, the initial population (skills, needs, and matches) was randomly generated. In real case scenario, you will have to load this data from other sources (like Excel, DB, or CSV file). Also, there was an upper limit of 5000 generations (all took about 5 seconds to complete).

After looking at the plot you can see some interesting facts:

  1. All cases are starting random solution, which is generally bad (the worst one is for 50 participants ( f \approx 0.52),
  2. All cases are getting very close the perfect matching after 3000 generations,
  3. There was one case of mutation (red line drop, near 5000-th generation),
  4. Learning is slower when the greater number of participants


This experiment concludes that evolutionary algorithm provides a very efficient way of solving match-making problems. They are easy to implement and very extendible to custom restrictions and limitations.

The usage of EA can also provide an extra value for generating online recommendations during the event. It's common that some of the participants are absent which is disturbing others expectations.