CARVIEW |
10 Essential PySpark Commands for Big Data Processing
Check out these 10 ways to leverage efficient distributed dataset processing combining the strengths of Spark and Python libraries for data science.

Image by Author | Ideogram
PySpark unifies the best of two worlds: the simplicity of Python language and libraries, and the scalability of Apache Spark.
This article lists — through code examples — 10 handy PySpark commands for turbocharging big data processing pipelines in Python projects.
Initial Setup
To illustrate the use of the 10 commands introduced, we import the necessary libraries, initialize a Spark session, and load the publicly available penguins dataset that describes penguin specimens belonging to three species using a mix of numerical and categorical features. The dataset is initially loaded into a Pandas DataFrame.
!pip install pyspark pandas
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder \
.appName("PenguinAnalysis") \
.config("spark.driver.memory", "2g") \
.getOrCreate()
pdf = pd.read_csv("https://raw.githubusercontent.com/gakudo-ai/open-datasets/refs/heads/main/penguins.csv")
The 10 Essential PySpark Commands for Big Data Processing
1. Loading data into a Spark DataFrame
Spark DataFrames differ from Pandas DataFrames in their distributed nature, lazy evaluation, and immutability. Converting a Pandas DataFrame is as simple as using the createDataFrame command.
df = spark.createDataFrame(pdf)
This data structure is more suited for parallel data processing under the scenes while maintaining many of the characteristics of regular DataFrames.
2. Select and filter data
The select and filter functions select specific columns (features) in a Spark DataFrame and filter rows (instances) where a specified condition is true. This example selects and filters penguins from the Gentoo species weighing over 5kg.
df.select("species", "body_mass_g").filter(df.species == "Gentoo").filter(df.body_mass_g > 5000).show()
3. Group and aggregate data
Grouping data by categories is typically necessary to perform aggregations like getting average values or other stats per category. This example shows how to use the groupBy and agg commands together to obtain summaries of average measurements per species.
df.groupBy("species").agg({"flipper_length_mm": "mean", "body_mass_g": "mean"}).show()
4. Window functions
Window functions perform calculations across rows related to the current row, such as ranking or running totals. This example shows how to create a window function that partitions the data into species and then ranks penguins by body mass within each species.
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col
windowSpec = Window.partitionBy("species").orderBy(col("body_mass_g").desc())
df.withColumn("mass_rank", rank().over(windowSpec)).show()
5. Join operations
Like SQL join operations for merging tables, PySpark join operations combine two DataFrames based on specified columns in common, using several policies like left join, right join, outer join, etc. This example builds a DataFrame containing total counts per species, then applies a left join based on the "linkage column" species to attach to every observation in the original dataset the count associated to its species.
species_stats = df.groupBy("species").count()
df.join(species_stats, "species", "left").show()
6. User-defined functions
PySpark's udf enables the creation of user-defined functions, essentially custom lambda functions that once defined can be used to apply complex transformations to columns in a DataFrame. This example defines and applies an user-defined function to map penguin body mass into a new categorical variable describing the penguin size as large or small.
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
size_category = udf(lambda x: "Large" if x > 4500 else "Small", StringType())
df.withColumn("size_class", size_category(df.body_mass_g)).show()
7. Pivot tables
Pivot tables are used to transform categories within a column, such as sex, into multiple columns aimed at describing aggregations or summary statistics for each category. Below we create a pivot table that, for each species, calculates and displays the average bill length separately by gender:
df.groupBy("species").pivot("sex").agg({"bill_length_mm": "avg"}).show()
8. Handling missing values
The fill and dropna functions can be used to impute missing values or clean the dataset from rows containing missing values. They can be also used jointly if we want to impute missing values of one specific column, but then we want to remove observations containing missing values in any other columns after that:
df.na.fill({"sex": "Unknown"}).dropna().show()
9. Save a processed dataset
Of course, once we have applied some processing operations on the dataset, we can save it in various formats such as parquet:
large_penguins = df.filter(df.body_mass_g > 4500)
large_penguins.write.mode("overwrite").parquet("large_penguins.parquet")
10. Perform SQL queries
The last command in this list allows you to embed SQL queries into a PySpark command and run them on temporary views of Spark DataFrames. This approach combines the flexibility of SQL with PySpark's distributed processing.
df.createOrReplaceTempView("penguins")
spark.sql("""
SELECT species, island,
AVG(body_mass_g) as avg_mass,
AVG(flipper_length_mm) as avg_flipper
FROM penguins
GROUP BY species, island
ORDER BY avg_mass DESC
""").show()
We highly recommend you practice trying out these 10 example commands one by one in a Jupyter or Google Colab notebook so that you can see the resulting outputs. The dataset is readily available to be read directly and loaded into your code.
Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.
Latest Posts
- We Benchmarked DuckDB, SQLite, and Pandas on 1M Rows: Here’s What Happened
- Prompt Engineering Templates That Work: 7 Copy-Paste Recipes for LLMs
- A Complete Guide to Seaborn
- 10 Command-Line Tools Every Data Scientist Should Know
- How I Actually Use Statistics as a Data Scientist
- The Lazy Data Scientist’s Guide to Exploratory Data Analysis
Top Posts |
---|
- 5 Fun AI Agent Projects for Absolute Beginners
- How I Actually Use Statistics as a Data Scientist
- The Lazy Data Scientist’s Guide to Exploratory Data Analysis
- Prompt Engineering Templates That Work: 7 Copy-Paste Recipes for LLMs
- 10 Command-Line Tools Every Data Scientist Should Know
- We Benchmarked DuckDB, SQLite, and Pandas on 1M Rows: Here’s What Happened
- A Gentle Introduction to TypeScript for Python Programmers
- A Complete Guide to Seaborn
- From Excel to Python: 7 Steps Analysts Can Take Today
- A Gentle Introduction to MCP Servers and Clients