Reproducible research and explaining predictions of any classifier

Recently I had a pleasure to give two talks at PyData Wrocław meetup group - about reproducible data science and explaining predictions of any classifier using LIME project. The meeting is taking place each month enabling others to discuss potential issues they encounter in their projects or simply share knowledge.

 

Reproducible data science

Practical approach

Assuring reproducibility is one of the most important issues in any scientific projects. See what techniques and tools you can use in your daily basis

Why have you done this to me?

Explaining predictions of any classifier

Very often it's nearly impossible to explain the decision made by a black-box classifier. But there is a new open source library solving this problem (LIME). Learn what's possible by seeing it in action.

Video

You can watch the whole presentation below (28:12):

You can download notebook and data files used in the examples here.

 

10,5 Python Libraries for Data Analysis Nobody Told You About

UPDATED: 23 Apr. 2017

Below is a list of little "less popular" Python libraries that can add tremendous value to your data projects.


LIME

Github stats: 84 watchers, 1004 stars, 155 forks

LIME project (local interpretable model-agnostic explanations) is about to reveal what were the motivations for any black-box classifier that picked certain action.

At the moment the reasoning is possible for text and tabular data (with continuous, discrete and both features). The project is constantly evolving and you can expect much, much more improvements over time.

 

All you need to provide is an algorithm that outputs probability for each class.

Just watch the promo-video for the project (2:55 min):


Yellowbrick

Github stats: 19 watchers, 140 stars, 43 forks

Yellowbrick is a collection of tools that are super-handy for visualization of machine learning issues related to feature or model selection/evaluation and parameter's tuning.

There are about 19 distinct tools available, ranging from simple boxplots to grid-search heat maps.

It's of course designed to play nicely with Scikit-learn package.


Traces

Github stats: 10 watchers, 176 stars, 18 forks

As written in the docs Traces aims for making it simple to manipulate, transform and perform analysis of unevenly spaced time series.

It offers some very handy helper functions for simplifying analysis like getting distributions by each day of a week or transforming to evenly spaced events (ie. for doing forecasting).


Quiver

Github stats: 41 watchers, 878 stars, 63 forks

Quiver is a kick-ass tool for doing interactive visualization of Keras convolutional network features.

The way it works it that you need to build and feed a Keras model into Quiver. Then with just one line of code start an embedded web-server with the app (built with React and Redux) and open it in your browser.

Watch the video how to explore layer activations on all the different images (1:47 min):


Dplython

Github stats: 30 watchers, 526 stars, 36 forks

If you have done some data analysis in R using dplyr package and later on switched to Python you probably know the pain of no such convenient piping possibility.

Dplython aims for providing the same functionality for pandas data-frames as dplyr in R.

Just see what's possible:

The library makes it possible to perform "pipeline-able" operations by creating special function decorators. You can read more about this here.


TSFRESH

Github stats: 70 watchers, 1745 stars, 118 forks

TSFRESH stands for Time Series Feature extraction based on scalable hypothesis tests".

The beauty of this project is that it can help you to automatically extract about various 100 (!) features from a signal.

To avoid duplicated or irrelevant features TSFRESH utilizes a filtering procedure evaluating the explaining power and importance of each characteristic for the regression or classification tasks.


Arrow

Github stats: 117 watchers, 3853 stars, 314 forks

Arrow is the library providing an impressive user experience for working with dates and time.

Even though Python is fully equipped with many modules for the same purpose, you probably can do this with Arrow faster, cleaner and simpler.

The library is inspired by famous moment.js.

To learn more about it read the docs.


TPOT

Github stats: 136 watchers, 1728 stars, 251 forks

TPOT utilizes genetic algorithms to automatically create and optimize machine learning pipelines. it will explore thousands of possibilities and get's back to you with the best one.

To show you this magic I have prepared a short (3:50 min) video (loading a Kaggle dataset, configuring and training app for 60 minutes). Click if you're curious to see what will happen.

It can be used both as CLI or within Python code. All you need to do is to prepare some good quality data and write a little script for starting computations (see examples). After some time (or iterations) script stops, providing you Python snippet (based on Sklearn) with the best configuration found.


PandaSQL

Github stats: 30 watchers, 381 stars, 50 forks

PandaSQL allows you to query Pandas DataFrames using SQL syntax.

First, you need to load interesting DataFrame into PandaSQL engine. Then enter SQL query and obtain results. You can use features like grouping, sub-queries, various kind of joins etc.

For more examples see this demo and blog post.


Auto-sklearn

Github stats: 50 watchers, 707 stars, 119 forks

Auto-sklearn is an automated machine learning toolkit.

It works similar to TPOT but instead of using genetic algorithms, Auto-sklearn leverages recent advantages in Bayesian optimization, meta-learning and ensemble construction.

Caution: author warns that the package probably won't work with Windows and mac OS operating systems.


Scikit-plot

Github stats: 26 watcher, 688 stars, 60 forks

An intuitive library to add plotting functionality to scikit-learn objects.

Scikit-plot is the result of an unartistic data scientist's dreadful realization that visualization is one of the most crucial components in the data science process, not just a mere afterthought.

Although it's name suggest tight coupling with Scikit-learn library it's flexible enought to work with different APIs as well.


To help you experiment and play with some of these libraries I have prepared a Docker image (see the Dockerfile to know what's included).

It can be used for running scripts or to perform as a remote interpreter.

To download and get into the console just type:

That's all.


If you know about other, hidden gems for doing data analysis in Python post them as the comment - I will be happy to review them and add to list.

Integrating Apache Spark 2.0 with PyCharm CE

The following post presents how to configure JetBrains PyCharm CE IDE to develop applications with Apache Spark 2.0+ framework.

  1. Download Apache Spark distribution pre-built for Hadoop (link).
  2. Unpack the archive. This directory will be later referred as $SPARK_HOME.
  3. Start PyCharm and create a new project File → New Project. Call it "spark-demo".
  4. Inside project create a new Python file - New → Python File. Call it run.py.
  5. Write a simple script counting the occurrences of A's and B's inside Spark's README.md file. Don't worry about the errors, we will fix them in next steps.
  6. Add required librariesPyCharm → Preferences ... → Project spark-demo → Project Structure → Add Content Root. Select all ZIP files from $SPARK_HOME/python/lib. Apply changes.
  7. Create a new run configuration. Go into Run → Edit Configurations → + → Python. Name it "Run with Spark" and select the previously created file as the script to be executed.
  8. Add environment variables. Inside created configuration add the corresponding environment variables. Save all changes.
  9. Run the script - Run → Run 'Run with Spark'. You should see that the script is executed properly within Spark context.

Now you can improve your working experience with IDE advanced features like debugging or code completion.

Happy coding.

The Tao of Text Normalization

Why bother?

Texts documents are noisy. You will realize it brutally when you switch from tutorial datasets into real world data. Cleaning things like misspellings, various abbreviations, emoticons, etc. will consume most of your time. But feature processing step is crucial to provide to quality samples for later analysis.

This article will provide you a gentle introduction to some techniques of normalizing text documents.

Flight-plan

We will discuss a couple of techniques that can be immediately used. The plan for the following sections is as follows:

  1. Basic processing
  2. Stemming
  3. Lemmatization
  4. Non-standard words mapping
  5. Stopwords

For experimentation purposes, an environment with Python 3.5 and NLTK module is used. If you have never used in before check this first.

All examples assume that basic modules are loaded, and there is a helper function capable of presenting the whole text from tokens.

As a playground, a review of new Apple Watch 2 is used (from Engadget).

Basic processing

Feature processing will start with bringing all characters into lowercase, and tokenize them using.RegexpTokenizer In this case, the regexp - \w+ will extract only word characters.

For all the consecutive examples we assume that the text is processed this way.

Stemming

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. ~ Wikipedia

NLTK introduces several stemmers i.e. Snowball, Porter, Lancaster. You should try them on your own and see which one will be best for the use case.

Lemmatization

Lemmatisation is the algorithmic process of determining the lemma for a given word. ~ Wikipedia

In NLTK you can use built-in WordNet lemmatizer. It will try to match each word to an instance within a WordNet. Mind that this process returns a word in its initial form if it cannot be found, and is much slower than standard stemming.

Non-standard words mapping

Another normalization task should be to distinguish non-standard words - for example, numbers, dates etc. Each such word should me mapped to a common value, for example:

  • Mr, Mrs, Dr, ... → ABR
  • 12/05/2015, 22/01/2016, ... → DATE
  • 0, 12, 45.0 → NUM
  • ...

This process allows to further easily summarize a text document and to derive new features (for example: count how many times a number appears).

Stop words removal

Stop words usually refer to the most common words in a language. They do not provide any informative value and should be removed. Notice however that when you are generating features with bigrams, stop words might still provide some useful insights.

There are built-in lists for many languages that you can use (or extend).

Let's see how a lemmatized version with removed stop words looks like:

Summary

Pre-processing text data makes it more specific. It's get cleaned from things human consider important but does not provide any value for machines. Very often a positive byproduct of normalization is the reduction of potential features used in a later analysis, which makes all computation significantly faster ("curse of dimensionality"). You should also keep in mind that some of the data is irreversibly lost.

So you are eating healthy... Oh, really?

What?

I bet you all heard that in order to stay fit you should consider eating 5 meals per day. This roughly means eating every 3 hours!

Inspired by a talk given by Tim Ferris, I decided to conduct a conscious experiment to track each meal I was consuming. Just for fun.

It all took about 2 months to complete, but the outcomes are very thought to provoke. I got acquired with the brutal truth about myself.

What's more interesting - the experiment is fully repeatable. At the end of the post, I will give you some Python scripts that will be helpful to replicate the whole process and obtain your own personalized results.

Let's begin.

Collecting data

First and foremost you need your some data. I have used a DietSnaps app. It's purpose is to take a photo of each consumed eaten meal. You can get it from the AppStore.

Even though the app provides an option to export all data (i.e. CSV file), I decided to take a manual approach. Each dish was labeled using the following categories:

  • was it healthy (rather yes / rather not)
  • alcohol (yes / no)
  • meat included (yes / no)
  • vegetables included (yes / no)
  • fruits included (yes / no)
  • fast-food (yes / no)
  • full-meal (yes / no)
  • form (mostly raw / mostly processed)

and put into Google Docs Spreadsheet - link.

Insights

The first thing I wanted to know is how much beer I drink each day. Let's try to visualize it with the following plot - average number of meals and beer bottles consumed per each weekday.

Oh, that interesting. I would take a bet that most beers are drunk on the weekend - but ... it's Wednesday (exhausting middle of the week). On the other hand - my training days are Mondays and Thursdays (less consumption). Hopefully, I was eating more these days.

Let's proceed with another question.

The whole experiment took nearly 8 weeks. The fact of taking photos of each meal has obviously made me more conscious about the quality of food. I should be eating better with each meal, right?

Everything was going well until week 3. After that time fast-food consumption was continuously growing. The overall number of meals with fruits is also very depressing.

"What gets measured gets managed" ~ Peter Drucker

Maybe there were little fruits and vegetables but the dishes were overall quite healthy. I can calculate some proportions for each day (green color means super-healthy eating, red - mega-unhealthy).

Mondays tend to be healthier than other days (new week begins with extra powers). Tuesdays and Thursdays are also quite ok (due to workouts). There are also some bad periods - see last three days of the fourth week. Awww.

Finally, let's try to answer how often do I eat? Am I following a rule of "meal every 3 hours"? To visualize we will use a great concept of a time-map (you can read more about it here).

Time-map is very good for recognizing how do events relate each other in time (are they occurring fast or rather slow). Each event is plotted on XY plot, where axes show time before previous, and after next event.

And this is where the drama starts.

All of the plots are fully interactive. If you zoom-in to the purple rectangle you will see how many meals were eaten in a healthy fashion.

It turns out that I have eaten roughly FIVE meals that were introduced and followed with about 3h break. It's exactly 1.86% of all meals. How the hell I was supposed to build muscle if only 1.86% of all meals during 8 weeks were consumed properly?

Try it

You might be thinking that you're living good. But these beliefs should also be verified from time to time.

Painful truth: numbers don't lie.

Plotting the results is useless if it is not followed by understanding the data and coming with some action to make a change.

If you are curious about your own performance feel free to use this Jupyter Notebook. It will generate all of the plots presented above for you.

Stay strong.