10,5 Python Libraries for Data Analysis Nobody Told You About

UPDATED: 23 Apr. 2017

Below is a list of little "less popular" Python libraries that can add tremendous value to your data projects.


LIME

Github stats: 84 watchers, 1004 stars, 155 forks

LIME project (local interpretable model-agnostic explanations) is about to reveal what were the motivations for any black-box classifier that picked certain action.

At the moment the reasoning is possible for text and tabular data (with continuous, discrete and both features). The project is constantly evolving and you can expect much, much more improvements over time.

 

All you need to provide is an algorithm that outputs probability for each class.

Just watch the promo-video for the project (2:55 min):


Yellowbrick

Github stats: 19 watchers, 140 stars, 43 forks

Yellowbrick is a collection of tools that are super-handy for visualization of machine learning issues related to feature or model selection/evaluation and parameter's tuning.

There are about 19 distinct tools available, ranging from simple boxplots to grid-search heat maps.

It's of course designed to play nicely with Scikit-learn package.


Traces

Github stats: 10 watchers, 176 stars, 18 forks

As written in the docs Traces aims for making it simple to manipulate, transform and perform analysis of unevenly spaced time series.

It offers some very handy helper functions for simplifying analysis like getting distributions by each day of a week or transforming to evenly spaced events (ie. for doing forecasting).


Quiver

Github stats: 41 watchers, 878 stars, 63 forks

Quiver is a kick-ass tool for doing interactive visualization of Keras convolutional network features.

The way it works it that you need to build and feed a Keras model into Quiver. Then with just one line of code start an embedded web-server with the app (built with React and Redux) and open it in your browser.

Watch the video how to explore layer activations on all the different images (1:47 min):


Dplython

Github stats: 30 watchers, 526 stars, 36 forks

If you have done some data analysis in R using dplyr package and later on switched to Python you probably know the pain of no such convenient piping possibility.

Dplython aims for providing the same functionality for pandas data-frames as dplyr in R.

Just see what's possible:

The library makes it possible to perform "pipeline-able" operations by creating special function decorators. You can read more about this here.


TSFRESH

Github stats: 70 watchers, 1745 stars, 118 forks

TSFRESH stands for Time Series Feature extraction based on scalable hypothesis tests".

The beauty of this project is that it can help you to automatically extract about various 100 (!) features from a signal.

To avoid duplicated or irrelevant features TSFRESH utilizes a filtering procedure evaluating the explaining power and importance of each characteristic for the regression or classification tasks.


Arrow

Github stats: 117 watchers, 3853 stars, 314 forks

Arrow is the library providing an impressive user experience for working with dates and time.

Even though Python is fully equipped with many modules for the same purpose, you probably can do this with Arrow faster, cleaner and simpler.

The library is inspired by famous moment.js.

To learn more about it read the docs.


TPOT

Github stats: 136 watchers, 1728 stars, 251 forks

TPOT utilizes genetic algorithms to automatically create and optimize machine learning pipelines. it will explore thousands of possibilities and get's back to you with the best one.

To show you this magic I have prepared a short (3:50 min) video (loading a Kaggle dataset, configuring and training app for 60 minutes). Click if you're curious to see what will happen.

It can be used both as CLI or within Python code. All you need to do is to prepare some good quality data and write a little script for starting computations (see examples). After some time (or iterations) script stops, providing you Python snippet (based on Sklearn) with the best configuration found.


PandaSQL

Github stats: 30 watchers, 381 stars, 50 forks

PandaSQL allows you to query Pandas DataFrames using SQL syntax.

First, you need to load interesting DataFrame into PandaSQL engine. Then enter SQL query and obtain results. You can use features like grouping, sub-queries, various kind of joins etc.

For more examples see this demo and blog post.


Auto-sklearn

Github stats: 50 watchers, 707 stars, 119 forks

Auto-sklearn is an automated machine learning toolkit.

It works similar to TPOT but instead of using genetic algorithms, Auto-sklearn leverages recent advantages in Bayesian optimization, meta-learning and ensemble construction.

Caution: author warns that the package probably won't work with Windows and mac OS operating systems.


Scikit-plot

Github stats: 26 watcher, 688 stars, 60 forks

An intuitive library to add plotting functionality to scikit-learn objects.

Scikit-plot is the result of an unartistic data scientist's dreadful realization that visualization is one of the most crucial components in the data science process, not just a mere afterthought.

Although it's name suggest tight coupling with Scikit-learn library it's flexible enought to work with different APIs as well.


To help you experiment and play with some of these libraries I have prepared a Docker image (see the Dockerfile to know what's included).

It can be used for running scripts or to perform as a remote interpreter.

To download and get into the console just type:

That's all.


If you know about other, hidden gems for doing data analysis in Python post them as the comment - I will be happy to review them and add to list.

Reproducible infrastructure for Data Scientist

You should always care about the reproducibility of your data analysis. A lot of related resources focus on how the data is processed. This article is about maintaining unified data-scientist technology stack.

An original version of the text was published on Kaggle blog. The main idea is to make use of Docker containers as a central place of storing all necessary libraries and tools. Later on, it is possible to import and export such container on different environments making sure that everything works exactly the same.

These are alternative commands that can be used to spin up the container.

Mind the differences between the original link:

  • fit to work on *nix systems,
  • no explicit Docker Machine is needed,
  • ability to render graphics (plots, etc)

Instructions

  • Make sure that the Docker Engine is installed.
  • Download the kaggle/python image (notice that it is nearly 8GB)

  • At the end of ~/.bashrc add new aliases:

  • Reload the shell

Those 3 commands are responsible for spinning instant containers executing desired Python tasks.

Happy coding.

Securing Docker container with HTTP Basic Auth

Overview

On the certain stage of developing a product, we want to make it publicly visible. Naturally, it needs to be restricted only to the privileged visitors. You might consider options like:

  • implementing custom authentication within the system,
  • configuring a server to act as a proxy between the user and the application,
  • limit the access to certain IP addresses,

We will also consider following a good practice of keeping the infrastructure as a code. There are many ways to provision the server - ChefPuppetAnsible or Docker.

This article presents the steps needed to secure a container exposing public port using an extra nginx container acting as a proxy.

Docker 1.9+ and Docker Compose are required.

Web-app

Begin with the docker-compose.yml for the exemplary demo application:

You can start the application with docker-compose up -d, and then proceed to the web browser. It is expected to see "Hello World".

The architecture is shown in the diagram below. In this case, Docker is forwarding an internal 5000 port to the host 80 port.

Nginx proxy

To secure the web-app we are going to:

  • remove the port mapping from the web-app (it won't be directly accessible),
  • add extra Nginx container with custom configuration (proxy all traffic),
  • to communicate nginx with web-app we are going to make use of the networking feature introduced in Docker 1.9.

The new architecture can be expressed as follows:

Before moving on stop previously started web-app container, create a directory called nginx and modify the docker-compose.yml to match the following snippet:

User credentials

A data about users capable of accessing the web-app will be stored as a .htpasswd file.

Let's create a credentials for the admin user:

The command creates a file with all encrypted user credentials.

Proxy configuration

Inside a directory nginx create a file nginx.conf. Paste the following content:

The configuration forwards all the traffic going to the webapphost (visible due to container networking) from port 5000 to 80. Also a HTTP Auth Basic security is declared and configured here.

Run the application

Make sure that the directory structure looks the same:

If all looks good spin up the infrastructure by typing docker-compose --x-networking up -d.

Now you can access the website only after providing proper credentials.