Recently I had a pleasure to give two talks at PyData Wrocław meetup group - about reproducible data science and explaining predictions of any classifier using LIME project. The meeting is taking place each month enabling others to discuss potential issues they encounter in their projects or simply share knowledge.
UPDATED: 23 Apr. 2017
An up-to-date list of little "less popular" Python libraries that can add tremendous value to your data projects.
- Download Apache Spark distribution pre-built for Hadoop (link).
- Unpack the archive. This directory will be later referred as
- Start PyCharm and create a new project
File → New Project. Call it "spark-demo".
- Inside project create a new Python file -
New → Python File. Call it
- Write a simple script counting the occurrences of A's and B's inside Spark's README.md file. Don't worry about the errors, we will fix them in next steps.
- Add required libraries.
PyCharm → Preferences ... → Project spark-demo → Project Structure → Add Content Root. Select all ZIP files from
$SPARK_HOME/python/lib. Apply changes.
- Create a new run configuration. Go into
Run → Edit Configurations → + → Python. Name it "Run with Spark" and select the previously created file as the script to be executed.
- Add environment variables. Inside created configuration add the corresponding environment variables. Save all changes.
- Run the script -
Run → Run 'Run with Spark'. You should see that the script is executed properly within Spark context.
Now you can improve your working experience with IDE advanced features like debugging or code completion.