The following post should serve as a guide for those trying to understand of inner-workings of Apache Spark. I have created it initially for organizing my knowledge and extended later on. It assumes that you, however, possess some basic knowledge of Spark.
All examples are written in Python 2.7 running with PySpark 2.1 but the rules are very similar for other APIs.
Recently I had a pleasure to give two talks at PyData Wrocław meetup group - about reproducible data science and explaining predictions of any classifier using LIME project. The meeting is taking place each month enabling others to discuss potential issues they encounter in their projects or simply share knowledge.
UPDATED: 23 Apr. 2017
An up-to-date list of little "less popular" Python libraries that can add tremendous value to your data projects.