You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Practical Streaming Analytics at scale with Spark on GCP and Confluent Kafka
1. About
This repo is a hands-on lab for streaming from Kafka on Confluent Cloud into BigQuery, with Apache Spark Structured Streaming on Dataproc Serverless Spark. It strives to demystify the products showcased and is less about building a perfect streaming application. It features a minimum viable example of joining a stream from Kafka with a static source in BigQuery, and sinking to BigQuery.
Audience
Data engineers
Prerequisites
Access to Google Cloud and Confluent Kafka
Basic knowledge of Google Cloud services featured in the lab, Kafka and Spark is helpful
Duration
1 hour from start to completion
Cost
< $100
Goals
Just enough knowlege of Confluent Kafka on GCP for streaming
Just enough knowlege of Dataproc Serverless for Spark
Just enough Terraform that can be repurposed for your use case
Quickstart code that can be repurposed for your use case
2. Architecture
2.1. Solution Architecture
About Dataproc Serverless Spark Batches:
Fully managed, autoscalable, secure Spark jobs as a service that eliminates administration overhead and resource contention, simplifies development and accelerates speed to production. Learn more about the service here.
Find templates that accelerate speed to production here
Want Google Cloud to train you on Serverless Spark for free, reach out to us here
Try out our other Serverless Spark centric hands-on labs here
2.2. Development Environment
Note: The above notebook environment is not covered in this lab, but is showcased in our Spark MLOps lab.
3. Use Case
The use case is basic sales and marketing campaign and promotion centric. Assume users logging on to a website and their data streamed to Kafka, and automatically entered into promotions/lotto for a trip.