UPDATED: 23 Apr. 2017
Below is a list of little "less popular" Python libraries that can add tremendous value to your data projects.
Github stats: 84 watchers, 1004 stars, 155 forks
LIME project (local interpretable model-agnostic explanations) is about to reveal what were the motivations for any black-box classifier that picked certain action.
At the moment the reasoning is possible for text and tabular data (with continuous, discrete and both features). The project is constantly evolving and you can expect much, much more improvements over time.
All you need to provide is an algorithm that outputs probability for each class.
Just watch the promo-video for the project (2:55 min):
Github stats: 19 watchers, 140 stars, 43 forks
Yellowbrick is a collection of tools that are super-handy for visualization of machine learning issues related to feature or model selection/evaluation and parameter's tuning.
There are about 19 distinct tools available, ranging from simple boxplots to grid-search heat maps.
It's of course designed to play nicely with Scikit-learn package.
Github stats: 10 watchers, 176 stars, 18 forks
As written in the docs Traces aims for making it simple to manipulate, transform and perform analysis of unevenly spaced time series.
It offers some very handy helper functions for simplifying analysis like getting distributions by each day of a week or transforming to evenly spaced events (ie. for doing forecasting).
Github stats: 41 watchers, 878 stars, 63 forks
Quiver is a kick-ass tool for doing interactive visualization of Keras convolutional network features.
The way it works it that you need to build and feed a Keras model into Quiver. Then with just one line of code start an embedded web-server with the app (built with React and Redux) and open it in your browser.
Watch the video how to explore layer activations on all the different images (1:47 min):
Github stats: 30 watchers, 526 stars, 36 forks
If you have done some data analysis in R using dplyr package and later on switched to Python you probably know the pain of no such convenient piping possibility.
Dplython aims for providing the same functionality for pandas data-frames as dplyr in R.
Just see what's possible:
The library makes it possible to perform "pipeline-able" operations by creating special function decorators. You can read more about this here.
Github stats: 70 watchers, 1745 stars, 118 forks
TSFRESH stands for " Time Series Feature extraction based on scalable hypothesis tests".
The beauty of this project is that it can help you to automatically extract about various 100 (!) features from a signal.
To avoid duplicated or irrelevant features TSFRESH utilizes a filtering procedure evaluating the explaining power and importance of each characteristic for the regression or classification tasks.
Github stats: 117 watchers, 3853 stars, 314 forks
Arrow is the library providing an impressive user experience for working with dates and time.
Even though Python is fully equipped with many modules for the same purpose, you probably can do this with Arrow faster, cleaner and simpler.
The library is inspired by famous moment.js.
To learn more about it read the docs.
Github stats: 136 watchers, 1728 stars, 251 forks
TPOT utilizes genetic algorithms to automatically create and optimize machine learning pipelines. it will explore thousands of possibilities and get's back to you with the best one.
To show you this magic I have prepared a short (3:50 min) video (loading a Kaggle dataset, configuring and training app for 60 minutes). Click if you're curious to see what will happen.
It can be used both as CLI or within Python code. All you need to do is to prepare some good quality data and write a little script for starting computations (see examples). After some time (or iterations) script stops, providing you Python snippet (based on Sklearn) with the best configuration found.
Github stats: 30 watchers, 381 stars, 50 forks
PandaSQL allows you to query Pandas DataFrames using SQL syntax.
First, you need to load interesting DataFrame into PandaSQL engine. Then enter SQL query and obtain results. You can use features like grouping, sub-queries, various kind of joins etc.
Github stats: 50 watchers, 707 stars, 119 forks
Auto-sklearn is an automated machine learning toolkit.
It works similar to TPOT but instead of using genetic algorithms, Auto-sklearn leverages recent advantages in Bayesian optimization, meta-learning and ensemble construction.
Caution: author warns that the package probably won't work with Windows and mac OS operating systems.
Github stats: 26 watcher, 688 stars, 60 forks
An intuitive library to add plotting functionality to scikit-learn objects.
Scikit-plot is the result of an unartistic data scientist's dreadful realization that visualization is one of the most crucial components in the data science process, not just a mere afterthought.
Although it's name suggest tight coupling with Scikit-learn library it's flexible enought to work with different APIs as well.
To help you experiment and play with some of these libraries I have prepared a Docker image (see the Dockerfile to know what's included).
It can be used for running scripts or to perform as a remote interpreter.
To download and get into the console just type:
docker run --rm -ti parrotprediction/docker-ds-python-libs
If you know about other, hidden gems for doing data analysis in Python post them as the comment - I will be happy to review them and add to list.