Integrating Apache Spark 2.0 with PyCharm CE

The following post presents how to configure JetBrains PyCharm CE IDE to develop applications with Apache Spark 2.0+ framework.

  1. Download Apache Spark distribution pre-built for Hadoop (link).
  2. Unpack the archive. This directory will be later referred as $SPARK_HOME.
  3. Start PyCharm and create a new project File → New Project. Call it "spark-demo".
  4. Inside project create a new Python file - New → Python File. Call it run.py.
  5. Write a simple script counting the occurrences of A's and B's inside Spark's README.md file. Don't worry about the errors, we will fix them in next steps.
  6. Add required librariesPyCharm → Preferences ... → Project spark-demo → Project Structure → Add Content Root. Select all ZIP files from $SPARK_HOME/python/lib. Apply changes.
  7. Create a new run configuration. Go into Run → Edit Configurations → + → Python. Name it "Run with Spark" and select the previously created file as the script to be executed.
  8. Add environment variables. Inside created configuration add the corresponding environment variables. Save all changes.
  9. Run the script - Run → Run 'Run with Spark'. You should see that the script is executed properly within Spark context.

Now you can improve your working experience with IDE advanced features like debugging or code completion.

Happy coding.

The Tao of Text Normalization

Why bother?

Texts documents are noisy. You will realize it brutally when you switch from tutorial datasets into real world data. Cleaning things like misspellings, various abbreviations, emoticons, etc. will consume most of your time. But feature processing step is crucial to provide to quality samples for later analysis.

This article will provide you a gentle introduction to some techniques of normalizing text documents.

Flight-plan

We will discuss a couple of techniques that can be immediately used. The plan for the following sections is as follows:

  1. Basic processing
  2. Stemming
  3. Lemmatization
  4. Non-standard words mapping
  5. Stopwords

For experimentation purposes, an environment with Python 3.5 and NLTK module is used. If you have never used in before check this first.

All examples assume that basic modules are loaded, and there is a helper function capable of presenting the whole text from tokens.

As a playground, a review of new Apple Watch 2 is used (from Engadget).

Basic processing

Feature processing will start with bringing all characters into lowercase, and tokenize them using.RegexpTokenizer In this case, the regexp - \w+ will extract only word characters.

For all the consecutive examples we assume that the text is processed this way.

Stemming

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. ~ Wikipedia

NLTK introduces several stemmers i.e. Snowball, Porter, Lancaster. You should try them on your own and see which one will be best for the use case.

Lemmatization

Lemmatisation is the algorithmic process of determining the lemma for a given word. ~ Wikipedia

In NLTK you can use built-in WordNet lemmatizer. It will try to match each word to an instance within a WordNet. Mind that this process returns a word in its initial form if it cannot be found, and is much slower than standard stemming.

Non-standard words mapping

Another normalization task should be to distinguish non-standard words - for example, numbers, dates etc. Each such word should me mapped to a common value, for example:

  • Mr, Mrs, Dr, ... → ABR
  • 12/05/2015, 22/01/2016, ... → DATE
  • 0, 12, 45.0 → NUM
  • ...

This process allows to further easily summarize a text document and to derive new features (for example: count how many times a number appears).

Stop words removal

Stop words usually refer to the most common words in a language. They do not provide any informative value and should be removed. Notice however that when you are generating features with bigrams, stop words might still provide some useful insights.

There are built-in lists for many languages that you can use (or extend).

Let's see how a lemmatized version with removed stop words looks like:

Summary

Pre-processing text data makes it more specific. It's get cleaned from things human consider important but does not provide any value for machines. Very often a positive byproduct of normalization is the reduction of potential features used in a later analysis, which makes all computation significantly faster ("curse of dimensionality"). You should also keep in mind that some of the data is irreversibly lost.

Dive into WordNet with NLTK

It's a common fact that analyzing text documents is a tough nut to crack for computers. Simple tasks like distinguishing whether a sentence has a positive meaning, or that two words mean literally the same require a have a lot of samples and train various machine learning models.

This article will show you how you can increase the quality of generated features and gain new insights about your data.

Enter WordNet

WordNet is a semantically oriented dictionary of English, similar to a traditional thesaurus but with richer structure.

NLTK module includes the English WordNet with 155 287 words and 117 659 synonym sets that are logically related to each other.

It's completely free! In this experiments below, we will use Python 3.5 version (which can be easily installed with PIP).

Begin with importing the WordNet module:

The first worth-understanding concept is a "synset":

Synset - "synonym set" - a collection of synonymous words

We can check what is the synset of the word motorcar:

The output means that word motorcar has just one possible context. It is identified by car.n.01 (we will call it "lemma code name") - the first noun (letter n) sense of car.

You can dig in and see what are other words within this particular synset:

All of the 5 words have the same context - a car (more precisely - car.n.01).

But as you might expect a word might be ambiguous, for example,  a printer:

You see that there are 3 possible contexts. To help understand the meaning of each one we can see it's definition and provided examples (if are available).

Like in previous examples, we can see what words (lemmas) are included in each lemma. Here you can see the reason how code names help to avoid the ambiguity. Note that we can call both lemma_names() and lemmas() on the synset.

Lexical relations

As you can see a WordNet creates a sort of hierarchy. There are very general and abstract concepts (like evententitything) and a very specific like a starship.

NLTK makes it easy to navigate between concepts in different directions by using some special terms.

Hyponym - a more specific concept

For example let's see what are the hyponyms of lemma printer.n.03 which is defined as "a machine that prints":

We can also go in the opposite way - towards the most general concept.

Hypernym - a more general concept.

Let's inspect it's value for the example above:

In this case, a more abstract term of printer machine is just a machine.

We can also obtain a top level hypernym (in this case entity.n.01) and complete path of words it takes to get to it:

Both hyponyms and hypernyms are called lexical relations. They form so-called "is-a" relationship between synsets.

There is also another way to navigate through WordNet - from components of items (meronyms) or to the things they are contained in (holonyms).

Meronym - denotes a part of something

For meronyms we can take advantage of two NLTK's functions:

  • part_meronyms() - obtains parts,
  • substance_meronyms() - obtains substances

Let's see it in action for the word tree.

On the other side we have holonyms:

Holonym - denotes a membership to something

Like above here we also have 2 functions available - part_holonyms() and substance_holonyms().

You can see how they work for words like atom and hydrogen.

There is also a relationship specific to verbs - entailments.

Entailment - denotes how verbs are involved

You can obtain them using the entailments() function.

Similarity

You have seen that words in WordNet are linked to each other in different ways. Given a particular synset you can traverse the whole network to find related objects.

Recall that each synset has one or more parents (hypernyms). If two of them are linked to the same root they might have several hypernyms in common - that fact might mean that they are closely related. You can get to it with function lowest_common_hypernyms().

Check what words truck and limousine have in common:

You can also examine how specific a certain word is, by analyzing it's depth in a hierarchy.

WordNet also introduces a specific metric for quantifying the similarity of two words by measuring shortest path between them. It outputs:

  • range (0,1) → 0 if not similar at all, 1 if perfectly similar
  • -1 → if there is no common hypernym

Let's try some examples:

What's next?

As always I recommend you experiment with NLTK module on your own and try to incorporate some features into your models.

In the first place you can try to:

  • use synsets code names instead of words,
  • add word's synonyms to as different features,
  • calculate how specific each word is (or it's average across a sentence, etc.)

For more information, you can refer to "Natural Language Processing with Python" by Steven Bird, Ewan Klein, and Edward Loper.

So you are eating healthy... Oh, really?

What?

I bet you all heard that in order to stay fit you should consider eating 5 meals per day. This roughly means eating every 3 hours!

Inspired by a talk given by Tim Ferris, I decided to conduct a conscious experiment to track each meal I was consuming. Just for fun.

It all took about 2 months to complete, but the outcomes are very thought to provoke. I got acquired with the brutal truth about myself.

What's more interesting - the experiment is fully repeatable. At the end of the post, I will give you some Python scripts that will be helpful to replicate the whole process and obtain your own personalized results.

Let's begin.

Collecting data

First and foremost you need your some data. I have used a DietSnaps app. It's purpose is to take a photo of each consumed eaten meal. You can get it from the AppStore.

Even though the app provides an option to export all data (i.e. CSV file), I decided to take a manual approach. Each dish was labeled using the following categories:

  • was it healthy (rather yes / rather not)
  • alcohol (yes / no)
  • meat included (yes / no)
  • vegetables included (yes / no)
  • fruits included (yes / no)
  • fast-food (yes / no)
  • full-meal (yes / no)
  • form (mostly raw / mostly processed)

and put into Google Docs Spreadsheet - link.

Insights

The first thing I wanted to know is how much beer I drink each day. Let's try to visualize it with the following plot - average number of meals and beer bottles consumed per each weekday.

Oh, that interesting. I would take a bet that most beers are drunk on the weekend - but ... it's Wednesday (exhausting middle of the week). On the other hand - my training days are Mondays and Thursdays (less consumption). Hopefully, I was eating more these days.

Let's proceed with another question.

The whole experiment took nearly 8 weeks. The fact of taking photos of each meal has obviously made me more conscious about the quality of food. I should be eating better with each meal, right?

Everything was going well until week 3. After that time fast-food consumption was continuously growing. The overall number of meals with fruits is also very depressing.

"What gets measured gets managed" ~ Peter Drucker

Maybe there were little fruits and vegetables but the dishes were overall quite healthy. I can calculate some proportions for each day (green color means super-healthy eating, red - mega-unhealthy).

Mondays tend to be healthier than other days (new week begins with extra powers). Tuesdays and Thursdays are also quite ok (due to workouts). There are also some bad periods - see last three days of the fourth week. Awww.

Finally, let's try to answer how often do I eat? Am I following a rule of "meal every 3 hours"? To visualize we will use a great concept of a time-map (you can read more about it here).

Time-map is very good for recognizing how do events relate each other in time (are they occurring fast or rather slow). Each event is plotted on XY plot, where axes show time before previous, and after next event.

And this is where the drama starts.

All of the plots are fully interactive. If you zoom-in to the purple rectangle you will see how many meals were eaten in a healthy fashion.

It turns out that I have eaten roughly FIVE meals that were introduced and followed with about 3h break. It's exactly 1.86% of all meals. How the hell I was supposed to build muscle if only 1.86% of all meals during 8 weeks were consumed properly?

Try it

You might be thinking that you're living good. But these beliefs should also be verified from time to time.

Painful truth: numbers don't lie.

Plotting the results is useless if it is not followed by understanding the data and coming with some action to make a change.

If you are curious about your own performance feel free to use this Jupyter Notebook. It will generate all of the plots presented above for you.

Stay strong.

Hypothesis Testing for Gangsters

Okay. Okay. OKAY. Look. I know you have a problem. You've been screwed by someone and now want your money back. Totally agree.

But first take a big breath and relax - you don't want to get into bigger trouble. Let's do it another way. I want to help to go one step further and do it like a PRO. And believe this makes a huge difference.

So go, grab your drink, and read this 5 tips.

How to do this

Read each step carefully. After the end, you will find what should you have after accomplishing it.

  1. Formulate hypothesis you want to validate
    A null hypothesis (H_{0}) is a statement we want to validate. Unless we will find sufficient evidence, there will be no reasons to reject it.

    A drug dealer states that cocaine is pure in 90%.

    A null hypothesis is (H_{0}\colon\ p = 0.9)An alternative hypothesis (H_{1}) is a statement that automatically becomes "true" (not rejected) if null hypothesis gets discarded.

    A customer doubts drug's purity. He states that it contains more than 10% additives. The alternative hypothesis can be (H_{1}\colon\ p < 0.9)

    After this step, you should have formulated (H_{0}) and (H_{1})

  2. Choose test statisticsOur overall aim is to validate the null hypothesis. We have to assure that it is true and then look for arguments to demolish it. Yeah.In more scientific speech we have to come up with probabilistic distribution ensuring that null hypothesis is correct.

    A customer bought 15 decks of a drug. After hosting a big party he realized that ONLY 11 decks were meeting the norm guaranteed by the dealer (test statistics). Remembering the wise words of a dealer, his test distribution can be (X \sim B(15; 0.9)). Someone will have a problem.

    After this step, you should have figured the test statistics (based on the experience) and the test distribution

  3. Choose a critical region (one-tail or two-tail test)Right now we have our probability distribution of test statistics, but still need to choose which values the null hypothesis get rejected (critical region) and for which accepted (acceptance region).We use a term of significance level (\alpha) which is a parameter describing certain probability, that for an event the likelihood of it's occurrence is small enough to agree that the null hypothesis gets rejected.

    A customer have chosen a value of significance level (\alpha = 5\%) meaning that the critical region (when we reject the null hypothesis) can be described as: (P(X < c) < 0.05)

    Depending on the form of (H_{1}) we can also specify whether the critical region is one-tailed or two-tailed.

    One-tail critical region occurs when the alternative hypothesis is expressed with inequities. For example if (H_{1}\colon p <\ c ) we should use left one-tailed critical region, and for (H_{1}\colon p >\ c right one-tailed.

    When the (H_{1}) is expressed with the (\neq) sign we are dealing with two-tail critical region. In this case, the critical region is placed in both tails of the distribution, where each side corresponds to the (\frac{\alpha}{2} ) probability.

    Because the alternative hypothesis is (H_{1}\colon\ p < 0.9 ) the scammed customer is dealing with one-tailed critical region.

    After this step, you should specify the significance level (\alpha ) and know whether the critical region is one-tailed or two-tailed.

  4. Calculate the probability (p-value)P-value is a probability of getting the same (or worse) results from the perspective of a null hypothesis.It's value depend on two things:
    • form of alternative hypothesis (H_{1}) (one or two tails),
    • a value of test statistics (based on the test distribution)

    In the case of our customer the test statistics is 11 (doses of pure drugs) and the critical region is located in left tail. The formula for p-value is ( P(X < 11) ). Taking into consideration (X \sim B(15; 0.9) ) it's value is (P(X < 11) = 0.55). To calculate this he used this snippet.

    After this step, you should obtain p-value

  5. Make a decisionIn this last step, we are finally deciding if the null hypothesis gets rejected or not (i.e. dealer was right or not).The null hypothesis will get rejected if the p-value will get into critical region.For example if the critical region is in the left tail the (H_{0}) will get rejected if ( \alpha < P_{value}).

    Customer has to reject his hypothesis (H_{1} ). In this case the P-value (( P_{value} = 0.055)) is greater than the significance level ( \alpha = 0.05 ), which means that the drug dealer was right ( H_{0}) is true). DAMN.

    After this step you finally know if there are reasons to reject (H_{0})

Q&A

Question: What value of significance level should I choose?

Answer: It all depends on how sure you want to be that you are making no mistake when rejecting a null hypothesis. For example, choosing ( \alpha = 1\% ) gives you more certainty that your decision about rejecting ( H_{0}) was correct than ( \alpha = 5\%).

Summary

I have to admit it. I'm a bit scared. You have received a powerful tool. Tool that help to prove you that you're RIGHT in many cases.

But please, remember about other that still might need some help. Share it with them, and make them your debtors.