ci: Tokenless self-hosted runner select #39900

delan · 2025-10-15T14:40:09Z

one of the biggest gaps in our self-hosted runner coverage is that pull request checks can’t run on self-hosted runners, because reserving a runner currently requires a MONITOR_API_TOKEN secret that is unavailable to pull requests.

this patch replaces the authenticated requests to POST /profile/<image>/take?unique_id&qualified_repo&run_id with unauthenticated requests to POST /select-runner?unique_id&qualified_repo&run_id, in the runner select jobs. to guard the CI monitors against abuse, the runner select job must prove that its request was genuine by uploading a small artifact containing the parameters of that request, which the monitors can then verify. once the job is marked as done, the monitors will refuse further reservation requests.

note that the old authenticated endpoint still works, so this patch can be reverted if anything breaks, or we can turn it off with the usual methods (force-github-hosted-runner or NO_SELF_HOSTED_RUNNERS).

Testing:

before (38min) https://github.com/servo/servo/actions/runs/18547747551/job/52869138042
after (8min) https://github.com/servo/servo/actions/runs/18834479889/job/53732619984

Fixes: part of #38141

Signed-off-by: Delan Azabani <dazabani@igalia.com>

.github/actions/runner-select/action.yml

sagudev · 2025-10-15T17:59:44Z

another interesting idea while reading this PR: can we just trigger some kind of github webhook (label or smth with less noise) and subscribe monitor to that webhook. This way we do not need to have any "public" API for this (all data would be encoded in this webhook event).

Signed-off-by: Delan Azabani <dazabani@igalia.com>

delan · 2025-10-16T01:33:27Z

another interesting idea while reading this PR: can we just trigger some kind of github webhook (label or smth with less noise) and subscribe monitor to that webhook. This way we do not need to have any "public" API for this (all data would be encoded in this webhook event).

potentially yea! it might be tricky for reserve-or-fallback scenarios like we do now, but for self-hosted-only workflows it looks like a good way to communicate demand and autoscale the cluster.

delan · 2025-10-18T07:09:20Z

this should be ready for review now :)

.github/actions/runner-select/action.yml

jschwe

Left some minor comments, but otherwise looks good to me. I believe sagudev also wanted to review this, so we should probably wait for that review too.

sagudev

This is certainly interesting to review, given that there is also code in https://github.com/servo/ci-runners/blob/61dd2d53d1c5af41e2f83d2aaa65961e5e6d797c/monitor/src/main.rs#L237. I think we should probably remove some complexity and UI spamming (with artifacts).

.github/actions/runner-select/action.yml

delan · 2025-10-20T11:11:32Z

This is certainly interesting to review, given that there is also code in https://github.com/servo/ci-runners/blob/61dd2d53d1c5af41e2f83d2aaa65961e5e6d797c/monitor/src/main.rs#L237.

monitor code is fair game! to be honest, i should probably start getting my monitor changes reviewed by someone else, but no one has expressed interest prior to now.

I think we should probably remove some complexity and UI spamming (with artifacts).

thanks, i’ll see what i can do. i like the idea of using the artifact creation time, because determining the job run start time is annoyingly difficult (it requires us to put a uuid in the name so we can uniquely identify the job in the job tree).

Signed-off-by: Delan Azabani <dazabani@igalia.com>

sagudev

otherwise LGTM

.github/actions/runner-select/action.yml

Signed-off-by: Delan Azabani <dazabani@igalia.com>

this patch extends servo#39900, making lint jobs run on self-hosted runners if available. Testing: - before (15min) <https://github.com/servo/servo/actions/runs/18532931828/job/52820332815> - after (<3min) <https://github.com/servo/servo/actions/runs/18839433347/job/53748113347> Fixes: part of servo#38141 Signed-off-by: Delan Azabani <dazabani@igalia.com>

currently our self-hosted runner system falls back to github-hosted runners if there’s no available capacity at the exact moment of the select runner request. this is suboptimal, because if the job would take 5x as long on github-hosted runners, then you could wait up to 80% of that time for a self-hosted runner and still win. this patch implements a new global queue service that allows self-hosted runner jobs to wait for available capacity. the service will run on [one server](https://ci0.servo.org) for now, as a single queue that dispatches to all available servers, like any efficient supermarket. queueing a job works like this: 1. **POST /profile/<profile_key>/enqueue?<unique_id>&<qualified_repo>&<run_id>** (tokenful) or **POST /enqueue?<unique_id>&<qualified_repo>&<run_id>** (tokenless) to enqueue a job. - both endpoints return a random token that is used to authenticate the client in the next step. - the tokenless endpoint validates that the request came from an authorised job, [using an artifact](servo/servo#39900). - the request is rejected if no servers are configured to target non-zero runners for the requested profile, because we may never be able to satisfy it. - there are no limits to queue depth (at least not yet), but clients probably have better knowledge of the nature of their job anyway, and in theory, they could use that knowledge to decide how long to wait (see below). 2. **POST /take/<unique_id>?<token>** to try to take the runner for the enqueued job. once capacity is available, this endpoint is effectively proxied to POST /profile/<profile_key>/take on one of the underlying servers. - if the client failed to provide the correct token from the previous step, the response is HTTP 403. - if the unique id is unknown, because it expired or the queue service restarted, the response is HTTP 404. - if there’s no capacity yet, the response is HTTP 503. repeat after waiting for ‘Retry-After’ seconds. - if taking the runner was successful, the response is HTTP 200, with the runner details as JSON. - if taking the runner was somehow unsuccessful (bug), the response is HTTP 200, with `null` as JSON. this sucks, to be honest, but it was also true for the underlying monitor API. - when we fix this, we should be careful about curl --retry. - clients are free to abandon a queued job without actually taking it, by doing nothing for 30 seconds. for now, the runner-select action client abandons a queued job if it has been waiting for one hour. i’ve added a “self-test” workflow that can be manually dispatched to test the new flow (e.g. [ok 1](https://github.com/delan/servo-ci-runners/actions/runs/19192407233/job/54868629052#step:3:138), [ok 2](https://github.com/delan/servo-ci-runners/actions/runs/19192417407/job/54868653437#step:3:138), [ok 3](https://github.com/delan/servo-ci-runners/actions/runs/19192424667/job/54868671844#step:3:138), [unsatisfiable](https://github.com/delan/servo-ci-runners/actions/runs/19192441772/job/54868712126#step:3:138), [unauthorised](https://github.com/delan/servo-ci-runners/actions/runs/19192458274/job/54868749313#step:3:138)). you can also play around with this locally by spinning up a monitor and a queue on your own machine, then sending the requests by hand (so three separate terminals): - `$ cargo build && sudo IMAGE_DEPS_DIR=$(nix eval --raw .\#image-deps) LIB_MONITOR_DIR=. $CARGO_TARGET_DIR/debug/monitor` - `$ cargo build && sudo IMAGE_DEPS_DIR=$(nix eval --raw .\#image-deps) LIB_MONITOR_DIR=. $CARGO_TARGET_DIR/debug/queue` - `$ unique_id=$RANDOM; curl --fail-with-body -sSX POST --retry-max-time 3600 --retry 3600 --retry-delay 1 'https://192.168.100.1:8002/take/'"$unique_id"'?token='"$(curl --fail-with-body -sSX POST --oauth2-bearer "$SERVO_CI_MONITOR_API_TOKEN" 'https://192.168.100.1:8002/profile/servo-windows10/enqueue?unique_id='"$unique_id"'&qualified_repo=delan/servo&run_id=123')"`

ci: Tokenless self-hosted runner select

95a1fee

Signed-off-by: Delan Azabani <dazabani@igalia.com>

delan force-pushed the tokenless-self-hosted-runner-select branch from 9094e6b to 95a1fee Compare October 15, 2025 14:48

jschwe reviewed Oct 15, 2025

View reviewed changes

.github/actions/runner-select/action.yml Outdated Show resolved Hide resolved

.github/actions/runner-select/action.yml Outdated Show resolved Hide resolved

jschwe reviewed Oct 15, 2025

View reviewed changes

.github/actions/runner-select/action.yml Show resolved Hide resolved

delan added 2 commits October 16, 2025 09:17

Fix early fallback checks

8a1a95d

Signed-off-by: Delan Azabani <dazabani@igalia.com>

Always mark the job as done

7a0a1d8

Signed-off-by: Delan Azabani <dazabani@igalia.com>

delan marked this pull request as ready for review October 16, 2025 03:06

servo-highfive added the S-awaiting-review There is new code that needs to be reviewed. label Oct 16, 2025

This was referenced Oct 16, 2025

CI: improve coverage of self-hosted runners #38141

Open

ci: Run lint jobs on self-hosted runners #39915

Merged

sagudev self-requested a review October 19, 2025 11:14

jschwe reviewed Oct 20, 2025

View reviewed changes

.github/actions/runner-select/action.yml Show resolved Hide resolved

jschwe approved these changes Oct 20, 2025

View reviewed changes

sagudev reviewed Oct 20, 2025

View reviewed changes

.github/actions/runner-select/action.yml Outdated Show resolved Hide resolved

delan added 4 commits October 27, 2025 15:38

Rename variable for clarity

8b8db60

Signed-off-by: Delan Azabani <dazabani@igalia.com>

Stop creating another artifact to mark the job as done

d86f99b

Signed-off-by: Delan Azabani <dazabani@igalia.com>

Set the lowest possible artifact retention

7f4bbb8

Signed-off-by: Delan Azabani <dazabani@igalia.com>

Delete artifact

fbb472e

Signed-off-by: Delan Azabani <dazabani@igalia.com>

delan requested a review from sagudev October 27, 2025 08:39

sagudev approved these changes Oct 27, 2025

View reviewed changes

.github/actions/runner-select/action.yml Show resolved Hide resolved

servo-highfive removed the S-awaiting-review There is new code that needs to be reviewed. label Oct 27, 2025

Pass the GITHUB_TOKEN as an input

8fb40cf

Signed-off-by: Delan Azabani <dazabani@igalia.com>

servo-highfive added the S-awaiting-review There is new code that needs to be reviewed. label Oct 27, 2025

delan enabled auto-merge October 27, 2025 09:54

delan added this pull request to the merge queue Oct 27, 2025

servo-highfive added the S-awaiting-merge The PR is in the process of compiling and running tests on the automated CI. label Oct 27, 2025

auto-merge was automatically disabled October 27, 2025 10:36
Pull Request is not mergeable

Merged via the queue into main with commit 57c164c Oct 27, 2025
33 checks passed

delan deleted the tokenless-self-hosted-runner-select branch October 27, 2025 11:19

servo-highfive removed the S-awaiting-merge The PR is in the process of compiling and running tests on the automated CI. label Oct 27, 2025

delan added a commit to servo/ci-runners that referenced this pull request Nov 4, 2025

analysis: Usage report for 2025-11-04 (since servo/servo#39900)

199b834

delan mentioned this pull request Nov 8, 2025

queue: initial queue service servo/ci-runners#69

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ci: Tokenless self-hosted runner select #39900

ci: Tokenless self-hosted runner select #39900

Uh oh!

delan commented Oct 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sagudev commented Oct 15, 2025 •

edited

Loading

Uh oh!

delan commented Oct 16, 2025

Uh oh!

delan commented Oct 18, 2025

Uh oh!

Uh oh!

jschwe left a comment

Uh oh!

sagudev left a comment

Uh oh!

Uh oh!

delan commented Oct 20, 2025

Uh oh!

sagudev left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

ci: Tokenless self-hosted runner select #39900

ci: Tokenless self-hosted runner select #39900

Uh oh!

Conversation

delan commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sagudev commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

delan commented Oct 16, 2025

Uh oh!

delan commented Oct 18, 2025

Uh oh!

Uh oh!

jschwe left a comment

Choose a reason for hiding this comment

Uh oh!

sagudev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

delan commented Oct 20, 2025

Uh oh!

sagudev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

delan commented Oct 15, 2025 •

edited

Loading

sagudev commented Oct 15, 2025 •

edited

Loading