"Call for backup" ... with Elasticsearch

Introduction

I bet you all heard this ancient adage:

There are two types of people - those who backup, and those who will backup.

This post is dedicated to the second group of users. Those who just started using Elasticsearch with their production data, and unconsciously feel that something is wrong.

 

Tools

No doubt - I'm a big Docker fan. It saves tons of time and is super easy to use. If you haven't done it before I strongly encourage you to learn it now.

For performing backup we will use official elasticsearch-dump tool within Docker container. In the example below I'm running Elasticsearch 2.4.

We will create two bash scripts - one for backing up data and the other, more importantly - for restoring it. Optionally you can also configure Cron scheduler to do this for you automatically.

Feel free to adjust the directories and file names accordingly to your needs - there are no strict rules here.

Backup

Create a file called perform_backup.sh with the following content:

Make sure to provide valid path to ES server and replace the <INDEX> with the name of the index you want to backup.

That's it.

Restore

Create a file revert_backup.sh and paste the code below:

Like in the previous example make sure that --output argument points to the destination you want it to point.

Also note that you need to pass a backup file as an argument for the script to run correctly, ie:

Cron

If you want to become a pro you can even go one step further - schedule automatic backups.

With only one command:

The following command adds an entry to crontab to execute the script every day at 5 AM.

Summary

Sooner or later everybody will be backing up their data. A feeling that we can deal with unexpected situations reassure our minds and is worth pursuing.

Remember to thoroughly test the whole process a couple times and see if you can fully rely on it. Scripts provided above are exemplary and you should adjust them accordingly to your needs.

Integrating Apache Spark 2.0 with PyCharm CE

The following post presents how to configure JetBrains PyCharm CE IDE to develop applications with Apache Spark 2.0+ framework.

  1. Download Apache Spark distribution pre-built for Hadoop (link).
  2. Unpack the archive. This directory will be later referred as $SPARK_HOME.
  3. Start PyCharm and create a new project File → New Project. Call it "spark-demo".
  4. Inside project create a new Python file - New → Python File. Call it run.py.
  5. Write a simple script counting the occurrences of A's and B's inside Spark's README.md file. Don't worry about the errors, we will fix them in next steps.
  6. Add required librariesPyCharm → Preferences ... → Project spark-demo → Project Structure → Add Content Root. Select all ZIP files from $SPARK_HOME/python/lib. Apply changes.
  7. Create a new run configuration. Go into Run → Edit Configurations → + → Python. Name it "Run with Spark" and select the previously created file as the script to be executed.
  8. Add environment variables. Inside created configuration add the corresponding environment variables. Save all changes.
  9. Run the script - Run → Run 'Run with Spark'. You should see that the script is executed properly within Spark context.

Now you can improve your working experience with IDE advanced features like debugging or code completion.

Happy coding.

Reproducible infrastructure for Data Scientist

You should always care about the reproducibility of your data analysis. A lot of related resources focus on how the data is processed. This article is about maintaining unified data-scientist technology stack.

An original version of the text was published on Kaggle blog. The main idea is to make use of Docker containers as a central place of storing all necessary libraries and tools. Later on, it is possible to import and export such container on different environments making sure that everything works exactly the same.

These are alternative commands that can be used to spin up the container.

Mind the differences between the original link:

  • fit to work on *nix systems,
  • no explicit Docker Machine is needed,
  • ability to render graphics (plots, etc)

Instructions

  • Make sure that the Docker Engine is installed.
  • Download the kaggle/python image (notice that it is nearly 8GB)

  • At the end of ~/.bashrc add new aliases:

  • Reload the shell

Those 3 commands are responsible for spinning instant containers executing desired Python tasks.

Happy coding.