| CARVIEW |
Select Language
HTTP/2 200
server: Apache
last-modified: Fri, 12 Dec 2025 03:14:55 GMT
etag: "9187-645b8aa6147cf-gzip"
content-encoding: gzip
access-control-allow-origin: *
content-security-policy: default-src 'self' data: blob: 'unsafe-inline' 'unsafe-eval' https://www.apachecon.com/ https://www.communityovercode.org/ https://*.apache.org/ https://apache.org/ https://*.scarf.sh/ https://*.algolia.net/ https://*.algolianet.com/ https://*.algolia.io/; script-src 'self' data: blob: 'unsafe-inline' 'unsafe-eval' https://www.apachecon.com/ https://www.communityovercode.org/ https://*.apache.org/ https://apache.org/ https://*.scarf.sh/ https://*.algolia.net/ https://*.algolianet.com/ https://*.algolia.io/; style-src 'self' data: blob: 'unsafe-inline' 'unsafe-eval' https://www.apachecon.com/ https://www.communityovercode.org/ https://*.apache.org/ https://apache.org/ https://*.scarf.sh/ https://*.algolia.net/ https://*.algolianet.com/ https://*.algolia.io/; frame-ancestors 'self'; frame-src 'self' data: blob: 'unsafe-inline' 'unsafe-eval' https://www.apachecon.com/ https://www.communityovercode.org/ https://*.apache.org/ https://apache.org/ https://*.scarf.sh/ https://*.algolia.net/ https://*.algolianet.com/ https://*.algolia.io/; worker-src 'self' data: blob:;
content-type: text/html
via: 1.1 varnish, 1.1 varnish
accept-ranges: bytes
age: 10104
date: Thu, 25 Dec 2025 20:32:35 GMT
x-served-by: cache-hel1410020-HEL, cache-bom-vanm7210085-BOM
x-cache: HIT, HIT
x-cache-hits: 14, 0
x-timer: S1766694756.969458,VS0,VE1
vary: Accept-Encoding
strict-transport-security: max-age=31536000; includeSubDomains; preload
content-length: 6624
Apache Spark™ - Unified Engine for large-scale data analytics
Unified engine for large-scale data analytics
Get StartedWhat is Apache Spark™?
Apache Spark™ is a multi-language engine for executing data engineering,
data science, and machine learning on single-node machines or clusters.
Simple.
Fast.
Scalable.
Unified.
Fast.
Scalable.
Unified.
Key features
Batch/streaming data
Unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R.
SQL analytics
Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses.
Data science at scale
Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling
Machine learning
Train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines.
Run now
Install with 'pip'
$ pip install pyspark
$ pyspark
Use the official Docker image
$ docker run -it --rm spark:python3 /opt/spark/bin/pyspark
df = spark.read.json("logs.json")
df.where("age > 21").select("name.first").show()# Every record contains a label and feature vector
df = spark.createDataFrame(data, ["label", "features"])
# Split the data into train/test datasets
train_df, test_df = df.randomSplit([.80, .20], seed=42)
# Set hyperparameters for the algorithm
rf = RandomForestRegressor(numTrees=100)
# Fit the model to the training data
model = rf.fit(train_df)
# Generate predictions on the test dataset.
model.transform(test_df).show()df = spark.read.csv("accounts.csv", header=True)
# Select subset of features and filter for balance > 0
filtered_df = df.select("AccountBalance", "CountOfDependents").filter("AccountBalance > 0")
# Generate summary statistics
filtered_df.summary().show()Run now
$ docker run -it --rm spark /opt/spark/bin/spark-sql
spark-sql>
SELECT
name.first AS first_name,
name.last AS last_name,
age
FROM json.`logs.json`
WHERE age > 21;Run now
$ docker run -it --rm spark /opt/spark/bin/spark-shell
scala>
val df = spark.read.json("logs.json")
df.where("age > 21")
.select("name.first").show()Run now
$ docker run -it --rm spark /opt/spark/bin/spark-shell
scala>
Dataset df = spark.read().json("logs.json");
df.where("age > 21")
.select("name.first").show();Run now
$ docker run -it --rm spark:r /opt/spark/bin/sparkR
>
df <- read.json(path = "logs.json")
df <- filter(df, df$age > 21)
head(select(df, df$name.first))The most widely-used
engine for scalable computing
Thousands of
companies, including 80% of the Fortune 500, use Apache Spark™.
Over 2,000 contributors to the open source project from industry and academia.
Over 2,000 contributors to the open source project from industry and academia.
Ecosystem
Apache Spark™ integrates with your favorite frameworks, helping to scale them to thousands of machines.
Data science and Machine learning







SQL analytics and BI






Storage and Infrastructure










Spark SQL engine: under the hood
Apache Spark™ is built on an advanced distributed SQL engine
for large-scale data
Adaptive Query Execution
Spark SQL adapts the execution plan at runtime, such as automatically setting the number of reducers and join algorithms.
Support for ANSI SQL
Use the same SQL you’re already comfortable with.
Structured and unstructured data
Spark SQL works on structured tables and unstructured data such as JSON or images.
TPC-DS 1TB No-Stats With vs. Without Adaptive Query Execution
Accelerates TPC-DS queries up to 8x
Join the community
Spark has a thriving open source community, with
contributors from around the globe building features, documentation and assisting other users.