Hypothesis Testing for Gangsters

Okay. Okay. OKAY. Look. I know you have a problem. You've been screwed by someone and now want your money back. Totally agree.

But first take a big breath and relax - you don't want to get into bigger trouble. Let's do it another way. I want to help to go one step further and do it like a PRO. And believe this makes a huge difference.

So go, grab your drink, and read this 5 tips.

How to do this

Read each step carefully. After the end, you will find what should you have after accomplishing it.

  1. Formulate hypothesis you want to validate
    A null hypothesis (H_{0}) is a statement we want to validate. Unless we will find sufficient evidence, there will be no reasons to reject it.

    A drug dealer states that cocaine is pure in 90%.

    A null hypothesis is (H_{0}\colon\ p = 0.9)An alternative hypothesis (H_{1}) is a statement that automatically becomes "true" (not rejected) if null hypothesis gets discarded.

    A customer doubts drug's purity. He states that it contains more than 10% additives. The alternative hypothesis can be (H_{1}\colon\ p < 0.9)

    After this step, you should have formulated (H_{0}) and (H_{1})

  2. Choose test statisticsOur overall aim is to validate the null hypothesis. We have to assure that it is true and then look for arguments to demolish it. Yeah.In more scientific speech we have to come up with probabilistic distribution ensuring that null hypothesis is correct.

    A customer bought 15 decks of a drug. After hosting a big party he realized that ONLY 11 decks were meeting the norm guaranteed by the dealer (test statistics). Remembering the wise words of a dealer, his test distribution can be (X \sim B(15; 0.9)). Someone will have a problem.

    After this step, you should have figured the test statistics (based on the experience) and the test distribution

  3. Choose a critical region (one-tail or two-tail test)Right now we have our probability distribution of test statistics, but still need to choose which values the null hypothesis get rejected (critical region) and for which accepted (acceptance region).We use a term of significance level (\alpha) which is a parameter describing certain probability, that for an event the likelihood of it's occurrence is small enough to agree that the null hypothesis gets rejected.

    A customer have chosen a value of significance level (\alpha = 5\%) meaning that the critical region (when we reject the null hypothesis) can be described as: (P(X < c) < 0.05)

    Depending on the form of (H_{1}) we can also specify whether the critical region is one-tailed or two-tailed.

    One-tail critical region occurs when the alternative hypothesis is expressed with inequities. For example if (H_{1}\colon p <\ c ) we should use left one-tailed critical region, and for (H_{1}\colon p >\ c right one-tailed.

    When the (H_{1}) is expressed with the (\neq) sign we are dealing with two-tail critical region. In this case, the critical region is placed in both tails of the distribution, where each side corresponds to the (\frac{\alpha}{2} ) probability.

    Because the alternative hypothesis is (H_{1}\colon\ p < 0.9 ) the scammed customer is dealing with one-tailed critical region.

    After this step, you should specify the significance level (\alpha ) and know whether the critical region is one-tailed or two-tailed.

  4. Calculate the probability (p-value)P-value is a probability of getting the same (or worse) results from the perspective of a null hypothesis.It's value depend on two things:
    • form of alternative hypothesis (H_{1}) (one or two tails),
    • a value of test statistics (based on the test distribution)

    In the case of our customer the test statistics is 11 (doses of pure drugs) and the critical region is located in left tail. The formula for p-value is ( P(X < 11) ). Taking into consideration (X \sim B(15; 0.9) ) it's value is (P(X < 11) = 0.55). To calculate this he used this snippet.

    After this step, you should obtain p-value

  5. Make a decisionIn this last step, we are finally deciding if the null hypothesis gets rejected or not (i.e. dealer was right or not).The null hypothesis will get rejected if the p-value will get into critical region.For example if the critical region is in the left tail the (H_{0}) will get rejected if ( \alpha < P_{value}).

    Customer has to reject his hypothesis (H_{1} ). In this case the P-value (( P_{value} = 0.055)) is greater than the significance level ( \alpha = 0.05 ), which means that the drug dealer was right ( H_{0}) is true). DAMN.

    After this step you finally know if there are reasons to reject (H_{0})

Q&A

Question: What value of significance level should I choose?

Answer: It all depends on how sure you want to be that you are making no mistake when rejecting a null hypothesis. For example, choosing ( \alpha = 1\% ) gives you more certainty that your decision about rejecting ( H_{0}) was correct than ( \alpha = 5\%).

Summary

I have to admit it. I'm a bit scared. You have received a powerful tool. Tool that help to prove you that you're RIGHT in many cases.

But please, remember about other that still might need some help. Share it with them, and make them your debtors.

Reproducible infrastructure for Data Scientist

You should always care about the reproducibility of your data analysis. A lot of related resources focus on how the data is processed. This article is about maintaining unified data-scientist technology stack.

An original version of the text was published on Kaggle blog. The main idea is to make use of Docker containers as a central place of storing all necessary libraries and tools. Later on, it is possible to import and export such container on different environments making sure that everything works exactly the same.

These are alternative commands that can be used to spin up the container.

Mind the differences between the original link:

  • fit to work on *nix systems,
  • no explicit Docker Machine is needed,
  • ability to render graphics (plots, etc)

Instructions

  • Make sure that the Docker Engine is installed.
  • Download the kaggle/python image (notice that it is nearly 8GB)

  • At the end of ~/.bashrc add new aliases:

  • Reload the shell

Those 3 commands are responsible for spinning instant containers executing desired Python tasks.

Happy coding.

Securing Docker container with HTTP Basic Auth

Overview

On the certain stage of developing a product, we want to make it publicly visible. Naturally, it needs to be restricted only to the privileged visitors. You might consider options like:

  • implementing custom authentication within the system,
  • configuring a server to act as a proxy between the user and the application,
  • limit the access to certain IP addresses,

We will also consider following a good practice of keeping the infrastructure as a code. There are many ways to provision the server - ChefPuppetAnsible or Docker.

This article presents the steps needed to secure a container exposing public port using an extra nginx container acting as a proxy.

Docker 1.9+ and Docker Compose are required.

Web-app

Begin with the docker-compose.yml for the exemplary demo application:

You can start the application with docker-compose up -d, and then proceed to the web browser. It is expected to see "Hello World".

The architecture is shown in the diagram below. In this case, Docker is forwarding an internal 5000 port to the host 80 port.

Nginx proxy

To secure the web-app we are going to:

  • remove the port mapping from the web-app (it won't be directly accessible),
  • add extra Nginx container with custom configuration (proxy all traffic),
  • to communicate nginx with web-app we are going to make use of the networking feature introduced in Docker 1.9.

The new architecture can be expressed as follows:

Before moving on stop previously started web-app container, create a directory called nginx and modify the docker-compose.yml to match the following snippet:

User credentials

A data about users capable of accessing the web-app will be stored as a .htpasswd file.

Let's create a credentials for the admin user:

The command creates a file with all encrypted user credentials.

Proxy configuration

Inside a directory nginx create a file nginx.conf. Paste the following content:

The configuration forwards all the traffic going to the webapphost (visible due to container networking) from port 5000 to 80. Also a HTTP Auth Basic security is declared and configured here.

Run the application

Make sure that the directory structure looks the same:

If all looks good spin up the infrastructure by typing docker-compose --x-networking up -d.

Now you can access the website only after providing proper credentials.