- Download Apache Spark distribution pre-built for Hadoop (link).
- Unpack the archive. This directory will be later referred as
- Start PyCharm and create a new project
File → New Project. Call it "spark-demo".
- Inside project create a new Python file -
New → Python File. Call it
- Write a simple script counting the occurrences of A's and B's inside Spark's README.md file. Don't worry about the errors, we will fix them in next steps.
- Add required libraries.
PyCharm → Preferences ... → Project spark-demo → Project Structure → Add Content Root. Select all ZIP files from
$SPARK_HOME/python/lib. Apply changes.
- Create a new run configuration. Go into
Run → Edit Configurations → + → Python. Name it "Run with Spark" and select the previously created file as the script to be executed.
- Add environment variables. Inside created configuration add the corresponding environment variables. Save all changes.
- Run the script -
Run → Run 'Run with Spark'. You should see that the script is executed properly within Spark context.
Now you can improve your working experience with IDE advanced features like debugging or code completion.