Boolean Multiplexer in Practice

Introduction

There are two popular types of problems for evaluating learning classifier systems:

  • single-step - like "question-answer" systems,
  • multi-steps - problems where multiple consequential steps are needed to solve it. Most popular are different kind of mazes (in literature often referred as various kind of MAZE or WOODS environments).

This article will focus on a method for testing single-step systems, where the environment has the Markov property (each state is independent of it's predecessor).

A method referred as boolean multiplexer function will be first described, followed by some examples and a simple Python implementation.

Enter ...

Multiplexer

First, let's gain some intuition about the idea of a multiplexer:

Multiplexing is the generic term used to describe the operation of sending one or more analog or digital signals over a common transmission line at different times or speeds. [source]

In the following scheme, an example of the 4-1 multiplexer with 4 inputs, 2 control signals, and 1 output is presented. The output Q can be one of the input signal A, B, C or D depending on the value of a and b.

There are of course many different configuration options available but this knowledge should be sufficient for now.

Boolean multiplexer function

Boolean multiplexer is a case where each signal is represented in binary using either 0 or 1.

There is a convention that the incoming signal consists of two, concatenated parts - control and data bits

In the example, above we are dealing with 6-bit boolean multiplexer. First 2 bits are capable of addressing 4 inputs ( 2^2 = 4 ) that came along.

The output is a data bit at a location specified by converting control bit number into decimal (in this case bin(01) = dec(1)). Data bits indexing starts from zero.

Examples

Below you will find three examples of multiplexer functions.

3-bit

Control bits: 1, Data bits: 2

The output is the 0-th data bit.

6-bit

Control bits: 2, Data bits: 4

The output is the 3-rd data bit.

11-bit

Control bits: 3, Data bits: 8

The output is the 5-th data bit.

Implementation

The following implementation generates a random binary signal (user needs to provide a number of control bits), and prints the correct value of the signal.

Mind that you need to make sure the bitstring module is available in your OS.

Here is an example of using 2 bits for controlling the signal (6-bit multiplexer):

 

Integrating Apache Spark 2.0 with PyCharm CE

The following post presents how to configure JetBrains PyCharm CE IDE to develop applications with Apache Spark 2.0+ framework.

  1. Download Apache Spark distribution pre-built for Hadoop (link).
  2. Unpack the archive. This directory will be later referred as $SPARK_HOME.
  3. Start PyCharm and create a new project File → New Project. Call it "spark-demo".
  4. Inside project create a new Python file - New → Python File. Call it run.py.
  5. Write a simple script counting the occurrences of A's and B's inside Spark's README.md file. Don't worry about the errors, we will fix them in next steps.
  6. Add required librariesPyCharm → Preferences ... → Project spark-demo → Project Structure → Add Content Root. Select all ZIP files from $SPARK_HOME/python/lib. Apply changes.
  7. Create a new run configuration. Go into Run → Edit Configurations → + → Python. Name it "Run with Spark" and select the previously created file as the script to be executed.
  8. Add environment variables. Inside created configuration add the corresponding environment variables. Save all changes.
  9. Run the script - Run → Run 'Run with Spark'. You should see that the script is executed properly within Spark context.

Now you can improve your working experience with IDE advanced features like debugging or code completion.

Happy coding.

The Tao of Text Normalization

Why bother?

Texts documents are noisy. You will realize it brutally when you switch from tutorial datasets into real world data. Cleaning things like misspellings, various abbreviations, emoticons, etc. will consume most of your time. But feature processing step is crucial to provide to quality samples for later analysis.

This article will provide you a gentle introduction to some techniques of normalizing text documents.

Flight-plan

We will discuss a couple of techniques that can be immediately used. The plan for the following sections is as follows:

  1. Basic processing
  2. Stemming
  3. Lemmatization
  4. Non-standard words mapping
  5. Stopwords

For experimentation purposes, an environment with Python 3.5 and NLTK module is used. If you have never used in before check this first.

All examples assume that basic modules are loaded, and there is a helper function capable of presenting the whole text from tokens.

As a playground, a review of new Apple Watch 2 is used (from Engadget).

Basic processing

Feature processing will start with bringing all characters into lowercase, and tokenize them using.RegexpTokenizer In this case, the regexp - \w+ will extract only word characters.

For all the consecutive examples we assume that the text is processed this way.

Stemming

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. ~ Wikipedia

NLTK introduces several stemmers i.e. Snowball, Porter, Lancaster. You should try them on your own and see which one will be best for the use case.

Lemmatization

Lemmatisation is the algorithmic process of determining the lemma for a given word. ~ Wikipedia

In NLTK you can use built-in WordNet lemmatizer. It will try to match each word to an instance within a WordNet. Mind that this process returns a word in its initial form if it cannot be found, and is much slower than standard stemming.

Non-standard words mapping

Another normalization task should be to distinguish non-standard words - for example, numbers, dates etc. Each such word should me mapped to a common value, for example:

  • Mr, Mrs, Dr, ... → ABR
  • 12/05/2015, 22/01/2016, ... → DATE
  • 0, 12, 45.0 → NUM
  • ...

This process allows to further easily summarize a text document and to derive new features (for example: count how many times a number appears).

Stop words removal

Stop words usually refer to the most common words in a language. They do not provide any informative value and should be removed. Notice however that when you are generating features with bigrams, stop words might still provide some useful insights.

There are built-in lists for many languages that you can use (or extend).

Let's see how a lemmatized version with removed stop words looks like:

Summary

Pre-processing text data makes it more specific. It's get cleaned from things human consider important but does not provide any value for machines. Very often a positive byproduct of normalization is the reduction of potential features used in a later analysis, which makes all computation significantly faster ("curse of dimensionality"). You should also keep in mind that some of the data is irreversibly lost.