10,5 Python Libraries for Data Analysis Nobody Told You About

UPDATED: 23 Apr. 2017

Below is a list of little "less popular" Python libraries that can add tremendous value to your data projects.


LIME

Github stats: 84 watchers, 1004 stars, 155 forks

LIME project (local interpretable model-agnostic explanations) is about to reveal what were the motivations for any black-box classifier that picked certain action.

At the moment the reasoning is possible for text and tabular data (with continuous, discrete and both features). The project is constantly evolving and you can expect much, much more improvements over time.

 

All you need to provide is an algorithm that outputs probability for each class.

Just watch the promo-video for the project (2:55 min):


Yellowbrick

Github stats: 19 watchers, 140 stars, 43 forks

Yellowbrick is a collection of tools that are super-handy for visualization of machine learning issues related to feature or model selection/evaluation and parameter's tuning.

There are about 19 distinct tools available, ranging from simple boxplots to grid-search heat maps.

It's of course designed to play nicely with Scikit-learn package.


Traces

Github stats: 10 watchers, 176 stars, 18 forks

As written in the docs Traces aims for making it simple to manipulate, transform and perform analysis of unevenly spaced time series.

It offers some very handy helper functions for simplifying analysis like getting distributions by each day of a week or transforming to evenly spaced events (ie. for doing forecasting).


Quiver

Github stats: 41 watchers, 878 stars, 63 forks

Quiver is a kick-ass tool for doing interactive visualization of Keras convolutional network features.

The way it works it that you need to build and feed a Keras model into Quiver. Then with just one line of code start an embedded web-server with the app (built with React and Redux) and open it in your browser.

Watch the video how to explore layer activations on all the different images (1:47 min):


Dplython

Github stats: 30 watchers, 526 stars, 36 forks

If you have done some data analysis in R using dplyr package and later on switched to Python you probably know the pain of no such convenient piping possibility.

Dplython aims for providing the same functionality for pandas data-frames as dplyr in R.

Just see what's possible:

The library makes it possible to perform "pipeline-able" operations by creating special function decorators. You can read more about this here.


TSFRESH

Github stats: 70 watchers, 1745 stars, 118 forks

TSFRESH stands for Time Series Feature extraction based on scalable hypothesis tests".

The beauty of this project is that it can help you to automatically extract about various 100 (!) features from a signal.

To avoid duplicated or irrelevant features TSFRESH utilizes a filtering procedure evaluating the explaining power and importance of each characteristic for the regression or classification tasks.


Arrow

Github stats: 117 watchers, 3853 stars, 314 forks

Arrow is the library providing an impressive user experience for working with dates and time.

Even though Python is fully equipped with many modules for the same purpose, you probably can do this with Arrow faster, cleaner and simpler.

The library is inspired by famous moment.js.

To learn more about it read the docs.


TPOT

Github stats: 136 watchers, 1728 stars, 251 forks

TPOT utilizes genetic algorithms to automatically create and optimize machine learning pipelines. it will explore thousands of possibilities and get's back to you with the best one.

To show you this magic I have prepared a short (3:50 min) video (loading a Kaggle dataset, configuring and training app for 60 minutes). Click if you're curious to see what will happen.

It can be used both as CLI or within Python code. All you need to do is to prepare some good quality data and write a little script for starting computations (see examples). After some time (or iterations) script stops, providing you Python snippet (based on Sklearn) with the best configuration found.


PandaSQL

Github stats: 30 watchers, 381 stars, 50 forks

PandaSQL allows you to query Pandas DataFrames using SQL syntax.

First, you need to load interesting DataFrame into PandaSQL engine. Then enter SQL query and obtain results. You can use features like grouping, sub-queries, various kind of joins etc.

For more examples see this demo and blog post.


Auto-sklearn

Github stats: 50 watchers, 707 stars, 119 forks

Auto-sklearn is an automated machine learning toolkit.

It works similar to TPOT but instead of using genetic algorithms, Auto-sklearn leverages recent advantages in Bayesian optimization, meta-learning and ensemble construction.

Caution: author warns that the package probably won't work with Windows and mac OS operating systems.


Scikit-plot

Github stats: 26 watcher, 688 stars, 60 forks

An intuitive library to add plotting functionality to scikit-learn objects.

Scikit-plot is the result of an unartistic data scientist's dreadful realization that visualization is one of the most crucial components in the data science process, not just a mere afterthought.

Although it's name suggest tight coupling with Scikit-learn library it's flexible enought to work with different APIs as well.


To help you experiment and play with some of these libraries I have prepared a Docker image (see the Dockerfile to know what's included).

It can be used for running scripts or to perform as a remote interpreter.

To download and get into the console just type:

That's all.


If you know about other, hidden gems for doing data analysis in Python post them as the comment - I will be happy to review them and add to list.

Norbert

Let's combine software craftsmanship and data engineering skills results to produce some clean and understandable code.
  • Reii Nakano

    What about Scikit-plot at https://github.com/reiinakano/scikit-plot

    It contains one-liner functions for generating the most common ML plots, reducing boilerplate by a great deal when visualizing.

    • Norbert Kozlowski

      Hi, never heard of it - but will definitely check. Thanks for bringing it out.

      • Reii Nakano

        Oh, the first release has been out for only a little over a week. Disclaimer: I built it

        • peace-data

          Cool useful library! I have just tried it out for KMeans. Got the first chart in 3 minutes after reading this article. Amazing!

  • Max Christlich

    Great list!