This project provides an isolated development environment for Apache tools like Kafka and PySpark using local JDK and virtual environments.
Works across macOS, Linux, and Windows (via WSL).
Install Windows Subsystem for Linux (WSL) by following the instructions.
Open WSL by opening a PowerShell terminal and running wsl.
wsl
Important: All remaining commands must be run from within the WSL environment. We will use the same ones the Mac/Linux users do when we are working in WSL.
Change to your home directory. Run these and all following commands in your shell ($ prompt) terminal.
cd ~/
- Copy the template repo into your GitHub account. You can change the name as desired.
- Open a terminal in your "Projects" folder or where ever you keep your coding projects.
- Avoid using "Documents" or any folder that syncs automatically to OneDrive or other cloud services.
- Clone this repository into that folder - Windows users - clone into your default WSL directory.
In the command below, if you changed the repository name, use that name instead.
For example - clone with something like this - but use your GitHub account name and repo name:
git clone https://github.com/denisecase/pro-analytics-apache-starter
Then cd into your new folder (if you changed the name, use that):
cd pro-analytics-apache-starter
Review requirements.txt and comment / uncomment the specific packages needed for your project.
python3 -m venv .venv
source .venv/bin/activate
Important Reminder: Always run source .venv/bin/activate
before working on the project.
python3 -m pip install --upgrade pip setuptools wheel
python3 -m pip install --upgrade -r requirements.txt
chmod +x ./01-setup/*.sh
chmod +x ./02-scripts/*.sh
chmod +x ./02-scripts/*.py
Verify compatible versions. See instructions in the file. Then, install the necessary OpenJDK locally.
./01-setup/download-jdk.sh
Use the commands below to install only the tools your project requires:
./01-setup/install-kafka.sh
./01-setup/install-pyspark.sh
Start the Kafka service (keep this terminal running)
./02-scripts/run-kafka.sh
In a second terminal, create a Kafka topic
./kafka/bin/kafka-topics.sh --create --topic test-topic --bootstrap-server localhost:9092
In that second terminal, list Kafka topics
./kafka/bin/kafka-topics.sh --list --bootstrap-server localhost:9092
In that second terminal, stop the Kafka service when done working with Kafka. Use whichever works.
./kafka/bin/kafka-server-stop.sh
pkill -f kafka
Start PySpark (leave this terminal running)
./02-scripts/run-pyspark.sh
Open a browser to https://localhost:4040/ to monitor Spark jobs and execution details.
In a second terminal, test Spark
python3 02-scripts/test-pyspark.py
Use that second terminal to stop the service when done:
pkill -f pyspark