Dive into WordNet with NLTK

It's a common fact that analyzing text documents is a tough nut to crack for computers. Simple tasks like distinguishing whether a sentence has a positive meaning, or that two words mean literally the same require a have a lot of samples and train various machine learning models.

This article will show you how you can increase the quality of generated features and gain new insights about your data.

Enter WordNet

WordNet is a semantically oriented dictionary of English, similar to a traditional thesaurus but with richer structure.

NLTK module includes the English WordNet with 155 287 words and 117 659 synonym sets that are logically related to each other.

It's completely free! In this experiments below, we will use Python 3.5 version (which can be easily installed with PIP).

Begin with importing the WordNet module:

The first worth-understanding concept is a "synset":

Synset - "synonym set" - a collection of synonymous words

We can check what is the synset of the word motorcar:

The output means that word motorcar has just one possible context. It is identified by car.n.01 (we will call it "lemma code name") - the first noun (letter n) sense of car.

You can dig in and see what are other words within this particular synset:

All of the 5 words have the same context - a car (more precisely - car.n.01).

But as you might expect a word might be ambiguous, for example,  a printer:

You see that there are 3 possible contexts. To help understand the meaning of each one we can see it's definition and provided examples (if are available).

Like in previous examples, we can see what words (lemmas) are included in each lemma. Here you can see the reason how code names help to avoid the ambiguity. Note that we can call both lemma_names() and lemmas() on the synset.

Lexical relations

As you can see a WordNet creates a sort of hierarchy. There are very general and abstract concepts (like evententitything) and a very specific like a starship.

NLTK makes it easy to navigate between concepts in different directions by using some special terms.

Hyponym - a more specific concept

For example let's see what are the hyponyms of lemma printer.n.03 which is defined as "a machine that prints":

We can also go in the opposite way - towards the most general concept.

Hypernym - a more general concept.

Let's inspect it's value for the example above:

In this case, a more abstract term of printer machine is just a machine.

We can also obtain a top level hypernym (in this case entity.n.01) and complete path of words it takes to get to it:

Both hyponyms and hypernyms are called lexical relations. They form so-called "is-a" relationship between synsets.

There is also another way to navigate through WordNet - from components of items (meronyms) or to the things they are contained in (holonyms).

Meronym - denotes a part of something

For meronyms we can take advantage of two NLTK's functions:

  • part_meronyms() - obtains parts,
  • substance_meronyms() - obtains substances

Let's see it in action for the word tree.

On the other side we have holonyms:

Holonym - denotes a membership to something

Like above here we also have 2 functions available - part_holonyms() and substance_holonyms().

You can see how they work for words like atom and hydrogen.

There is also a relationship specific to verbs - entailments.

Entailment - denotes how verbs are involved

You can obtain them using the entailments() function.

Similarity

You have seen that words in WordNet are linked to each other in different ways. Given a particular synset you can traverse the whole network to find related objects.

Recall that each synset has one or more parents (hypernyms). If two of them are linked to the same root they might have several hypernyms in common - that fact might mean that they are closely related. You can get to it with function lowest_common_hypernyms().

Check what words truck and limousine have in common:

You can also examine how specific a certain word is, by analyzing it's depth in a hierarchy.

WordNet also introduces a specific metric for quantifying the similarity of two words by measuring shortest path between them. It outputs:

  • range (0,1) → 0 if not similar at all, 1 if perfectly similar
  • -1 → if there is no common hypernym

Let's try some examples:

What's next?

As always I recommend you experiment with NLTK module on your own and try to incorporate some features into your models.

In the first place you can try to:

  • use synsets code names instead of words,
  • add word's synonyms to as different features,
  • calculate how specific each word is (or it's average across a sentence, etc.)

For more information, you can refer to "Natural Language Processing with Python" by Steven Bird, Ewan Klein, and Edward Loper.

Norbert

Let's combine software craftsmanship and data engineering skills results to produce some clean and understandable code.