Temporal URL Scraper

This repo implements a periodic URL scraper in Temporal.

URL scraping could be implemented with cron workflows with a 1-1 mapping between a URL and an activity. This becomes problematic as you scale the number of URLs as you'll be incurring the expensive cost of running an activity for every new url you add, every scrape interval.

To overcome these performance issues, we'll define the following goal:

Batching URLs so an activity can process multiple URLs at once, i.e. the number of activities executed per interval should approximately equal (number of urls) / MAX_BATCH_SIZE

Running this sample

Make sure Temporal Server is running locally (see the quick install guide).
npm install to install dependencies.
npm run start.watch to start the Worker.
Open client.ts and enter a url to scrape
In another shell, npm run workflow to run the Workflow Client.

The client should log the Workflow ID that is started, and you should see it reflected in Temporal Web UI.

You'll see a chain of logs that trace the flow of a new scraped url. After a few seconds, you should find that it attempts to scrape the url every SCRAPE_INTERVAL.

Ensuring batches with gaps are filled after removing a scraped url from the batch

Initially, our batch id assigner doesn't care about past batches, it assumes that all batches before currentBatchId are completely full. This becomes problematic when we want to stop scraping a url. If we remove a url from a batch url list, we'll now have a gap and start to become inefficient in our batching (breaking the goal at the top of this doc).

To overcome this issue, we will record the batches with gaps in the batch id assigner and prioritise batches with gaps when assigning new urls to a batch.

Upgrading/Versioning

Changing SCRAPE_FREQUENCY doesn't require patching as X

TODOs

Cleanup with continueAsNew
Heuristic estimation guidelines for continueAsNew & event history
Activity implementation
Retry failed scrapes via activity heartbeating
Re-assign batch gaps after removing a url from a batch
What to do if you terminate the batch id singleton
How to handle failures inside batch id assigner singleton such that it doesn't crash it (e.g. via handler.signal etc)

Overview

https://app.excalidraw.com/l/60O4zdIqdtq/6iZauXuE2DA

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
src		src
.eslintignore		.eslintignore
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
.nvmrc		.nvmrc
.post-create		.post-create
README.md		README.md
docker-compose.yaml		docker-compose.yaml
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Temporal URL Scraper

Running this sample

Ensuring batches with gaps are filled after removing a scraped url from the batch

Upgrading/Versioning

TODOs

Overview

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

andreasasprou/temporal-url-batch-scraping

Folders and files

Latest commit

History

Repository files navigation

Temporal URL Scraper

Running this sample

Ensuring batches with gaps are filled after removing a scraped url from the batch

Upgrading/Versioning

TODOs

Overview

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages