Practical Streaming Analytics at scale with Spark on GCP and Confluent Kafka

1. About

This repo is a hands-on lab for streaming from Kafka on Confluent Cloud into BigQuery, with Apache Spark Structured Streaming on Dataproc Serverless Spark. It strives to demystify the products showcased and is less about building a perfect streaming application. It features a minimum viable example of joining a stream from Kafka with a static source in BigQuery, and sinking to BigQuery.

Audience

Data engineers

Prerequisites

Access to Google Cloud and Confluent Kafka
Basic knowledge of Google Cloud services featured in the lab, Kafka and Spark is helpful

Duration

1 hour from start to completion

Cost

< $100

Goals

Just enough knowlege of Confluent Kafka on GCP for streaming
Just enough knowlege of Dataproc Serverless for Spark
Just enough Terraform that can be repurposed for your use case
Quickstart code that can be repurposed for your use case

2. Architecture

2.1. Solution Architecture

About Dataproc Serverless Spark Batches: Fully managed, autoscalable, secure Spark jobs as a service that eliminates administration overhead and resource contention, simplifies development and accelerates speed to production. Learn more about the service here.

Find templates that accelerate speed to production here
Want Google Cloud to train you on Serverless Spark for free, reach out to us here
Try out our other Serverless Spark centric hands-on labs here

2.2. Development Environment

Note: The above notebook environment is not covered in this lab, but is showcased in our Spark MLOps lab.

3. Use Case

The use case is basic sales and marketing campaign and promotion centric. Assume users logging on to a website and their data streamed to Kafka, and automatically entered into promotions/lotto for a trip.

4. The Data

5. Lab Modules

#	Module
Module 1	Provision the Google Cloud environment with Terraform
Module 2	Provision the Confluent Cloud environment
Module 3	Publish events to Kafka
Module 4	Spark Structured Streaming Kafka consumer - basic
Module 5	Spark Structured Streaming Kafka consumer - join with static data

6. Remember...

Shut down/delete resources when done to avoid unnecessary billing.

7. Credits

#	Collaborators	Company	Contribution
1.	Anagha Khanolkar	Google Cloud	Author of Spark application
2.	Elena Cuevas	Confluent	Lab vision & Kafka producer code

8. Contributions

Community contribution to improve the lab is very much appreciated.

9. Getting Help

If you have any questions or if you found any problems with this repository, please report through GitHub issues.

Name		Name	Last commit message	Last commit date
Latest commit History 216 Commits
00-images		00-images
01-environment-setup		01-environment-setup
02-scripts		02-scripts
03-configuration		03-configuration
04-lab-guide		04-lab-guide
05-jars		05-jars
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Practical Streaming Analytics at scale with Spark on GCP and Confluent Kafka

1. About

Audience

Prerequisites

Duration

Cost

Goals

2. Architecture

2.1. Solution Architecture

2.2. Development Environment

3. Use Case

4. The Data

5. Lab Modules

6. Remember...

7. Credits

8. Contributions

9. Getting Help

About

Uh oh!

Releases

Packages

Uh oh!

Languages

anagha-google/spark-on-gcp-with-confluent-kafka

Folders and files

Latest commit

History

Repository files navigation

Practical Streaming Analytics at scale with Spark on GCP and Confluent Kafka

1. About

Audience

Prerequisites

Duration

Cost

Goals

2. Architecture

2.1. Solution Architecture

2.2. Development Environment

3. Use Case

4. The Data

5. Lab Modules

6. Remember...

7. Credits

8. Contributions

9. Getting Help

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages