ci: Run bencher jobs on self-hosted runners #39272

delan · 2025-09-12T13:24:44Z

some bencher jobs, specifically linux release profile jobs, measure runtime perf by running speedometer and dromaeo. doing this on GitHub-hosted runners is suboptimal, because GitHub-hosted runners are not under our control and their performance can vary wildly depending on what hosts we get and how busy they are.

this patch depends on #39270 and #39271, and the new ci3 and ci4 servers deployed in servo/ci-runners#49. these servers provide a more controlled environment for benchmarking by using known hardware that runs one job at a time and no other work (servo/project#160), with some of the techniques we’ve developed for accurate measurement:

we disable CPU frequency boost and hyperthreading
we pin guest CPUs to specific host CPUs
we isolate a subset of CPUs from all processes and scheduling interrupts
- the bencher workflow does not take advantage of this yet, but it will in a later patch

to use ci3 and ci4 for bencher jobs, we add them to the list of self-hosted runner servers, then make the bencher workflow try to find a servo-ubuntu2204-bench runner if speedometer and/or dromaeo have been requested. to avoid mixing data, we set the bencher “testbed” based on where the runner came from:

for GitHub-hosted runners, we continue to use “ubuntu-22.04”
for runners on ci3 or ci4, we use “self-hosted-image:servo-ubuntu2204-bench”

Testing:

before, always GitHub-hosted: job run → report
after, self-hosted: job run → report
after, GitHub-hosted: job run → report
there are also counterparts for other platforms in the workflow runs above

Fixes: #39269

.github/workflows/bencher.yml

jschwe

One overall concern I have: We are not actually interested in the release profile, since release enables debug assertions, which has a major perf impact.
To keep the data on bencher clean, I think we should prefix the benchmark names (or some other thing we can filter on) with the cargo profile. Or even better, just always benchmark in production mode, since that is the only metric that matters.

.github/actions/runner-select/action.yml

.github/workflows/bencher.yml

jschwe · 2025-09-17T02:24:27Z

.github/workflows/bencher.yml

+        uses: ./.github/actions/runner-select
+        with:
+          monitor-api-token: ${{ secrets.SERVO_CI_MONITOR_API_TOKEN }}
+          github-hosted-runner-label: ubuntu-22.04


Is there a usecase for falling back to github hosted runners for benchmarks?

in general, we try to make our usage of self-hosted runners gracefully degrade to github-hosted runners, because self-hosted runners are provided on a best-effort basis. and for bencher jobs, there are cases where we only measure binary size, so we use force-github-hosted-runner to force those cases to use github-hosted runners.

there are cases where we only measure binary size, so we use force-github-hosted-runner to force those cases to use github-hosted runners.

I already lost context here (and am a bit time constrained now, so can't look at the code again), but do non-perf related metrics like binary size need to be handled by the bencher workflow? I would imagine that the metric could be a regular output of the build workflow (and either directly uploaded, or if tokens are a problem, eventually uploaded by the bencher perf job).

In general, we try to make our usage of self-hosted runners gracefully degrade to github-hosted runners,

I think that is great, but at least for perf jobs it doesn't really make sense to me, since the perf measurements on github-hosted runners are not useful in practice due to the high volatility (counting instructions might work there, but that metric is also not incredibly useful).

I'm mainly wondering if the added complexity (to handle github-hosted runners) here is really needed or not.

I already lost context here (and am a bit time constrained now, so can't look at the code again), but do non-perf related metrics like binary size need to be handled by the bencher workflow? I would imagine that the metric could be a regular output of the build workflow (and either directly uploaded, or if tokens are a problem, eventually uploaded by the bencher perf job).

i agree. i think this would be a good rework to do in a separate patch.

I think that is great, but at least for perf jobs it doesn't really make sense to me, since the perf measurements on github-hosted runners are not useful in practice due to the high volatility (counting instructions might work there, but that metric is also not incredibly useful).

that’s fair, yea, do you think we should skip running speedometer and dromaeo if no self-hosted runners are available?

I'm mainly wondering if the added complexity (to handle github-hosted runners) here is really needed or not.

i think this is the approach with the least added complexity overall, because it takes the existing bencher job and self-hosts it using the same pattern for selecting a runner as most of our other self-hosted jobs.

without fallback (even to a github-hosted no-op job), any downtime in the self-hosted runners would break everyone’s builds, which is currently a non-starter imo. i’m working towards a world where we don’t rely on github-hosted runners and our self-hosted runners are zero maintenance, but we’re not quite there yet.

that’s fair, yea, do you think we should skip running speedometer and dromaeo if no self-hosted runners are available?

Probably we should avoid uploading metrics we aren't going to use (since there presumably is a cost, and bencher.dev is a single-person project as far as I understand). I think bencher also recently introduced rate limiting (for non paying orgs).

i think this is the approach with the least added complexity overall, because it takes the existing bencher job and self-hosts it using the same pattern for selecting a runner as most of our other self-hosted jobs.

That's fair.

without fallback (even to a github-hosted no-op job), any downtime in the self-hosted runners would break everyone’s builds, which is currently a non-starter imo.

I'm not sure I agree here. I think we do need to transition towards a model where performance benchmarking is mandatory for pull requests to be merged. Performance regressions are just as serious as test regressions, so if we can't measure, then the MQ being broken is a feature in my opinion. The main problem here in my opinion is that we don't really have redundancy in terms of infrastructure maintainers. But I guess this discussion might be better had on zulip in the TSC channel to have a bit more visibility.

If we could trigger the performance testing workflow manually, and backfill performance data after a self-hosted runner downtime, so we can easily identify regressions merged to main (and revert), that could also be a viable short-term - medium-term bandaid. I'm not sure how feasible that would be though.

I already lost context here (and am a bit time constrained now, so can't look at the code again), but do non-perf related metrics like binary size need to be handled by the bencher workflow? I would imagine that the metric could be a regular output of the build workflow (and either directly uploaded, or if tokens are a problem, eventually uploaded by the bencher perf job).

This was done because bencher requires to send all data for single point (if we would send binary size from build workflow it would show as separate point).

i was inclined to agree with you that github-hosted benchmark results are not useful, but issues like #39641 show that even our existing github-hosted benchmarking can be very useful. i do expect that self-hosted benchmark results will be more useful though.

My assumption is that we would want to prevent PRs that regress performance from getting merged eventually (but rather sooner than later). We definitely need selfhosted runners to be able to measure accurately enough to be able to add thresholds that are low enough to be generally useful. GitHub hosted runners could be used to prevent major regressions, but anything smaller than lets say 10-20% would easily fall through the cracks given the volatility we see. I agree that it is better than nothing, but we would also need to measure with github-hosted runners on every commit to establish a baseline - Since our utilization is high, and we often use 100% of our github-hosted runners, I'm not sure if this is worth it over just accepting that the merge queue would be broken if our self-hosted benchmark runners are down.

the current relationship we have with the self-hosted runners is that they’re provided on a best-effort basis, because no one is on call for outages. if you’re interested in changing that, we should discuss how to get there with the broader servo community. i don’t think this patch is the right place to unilaterally change that relationship and decide it’s acceptable for the merge queue to be broken for hours or days at a time.

I think moving to self-hosted runners is independent of this discussion, so I'd propose we move to the self-hosted runners first and then we discuss in Zulip about how strict we want to be.

BTW, in other projects like Chromium you don't get blocked due to performance regressions, the patch lands and then you get notified afterwards about the regressions, sometimes the patch has to be reverted, other times a fix can get merged and solve the regression. Just to show how other projects do this. The good part in Chromium is that the notifications are automatic opening a bug and tagging the person that landed the original patch, if we could do something like that in Bencher would be quite usefull.

i don’t think this patch is the right place to unilaterally change that relationship and decide it’s acceptable for the merge queue to be broken for hours or days at a time.

My intention was not to suggest that we do this now as part of this patch. I mainly meant to state that I believe long-term we want something like that (and yes that would discussions on how to practically implement it). I was mainly wondering if we could just not require the bencher workflows for the MQ (instead of falling back), and make it a requirement once self-hosted is stable enough, and we have relevant procedures inplace. But this PR has been cooking for long enough, so I'd be fine with merging as is and picking up the discussion another time.

BTW, in other projects like Chromium you don't get blocked due to performance regressions, the patch lands and then you get notified afterwards about the regressions, sometimes the patch has to be reverted, other times a fix can get merged and solve the regression.

I only have second-hand knowledge about this, but my understanding was that chromium CI does have multiple layers, and there are performance tests run before merging (although the number of test-configurations run only after merging is considerably larger).

The good part in Chromium is that the notifications are automatic opening a bug and tagging the person that landed the original patch, if we could do something like that in Bencher would be quite usefull.

We can setup thresholds in bencher, which would trigger a warning on bencher (as a comment in the PR if run before merging), but the usefulness of thresholds correlates strongly with the stability of the results.

Signed-off-by: Delan Azabani <dazabani@igalia.com>

delan · 2025-10-06T11:01:55Z

One overall concern I have: We are not actually interested in the release profile, since release enables debug assertions, which has a major perf impact.
To keep the data on bencher clean, I think we should prefix the benchmark names (or some other thing we can filter on) with the cargo profile. Or even better, just always benchmark in production mode, since that is the only metric that matters.

note that our runtime perf benchmarks already exclusively use the release profile on linux, and i don’t think it’s true that we are completely uninterested in them or that those metrics don’t matter at all. regressions in production builds are likely to be regressions in release builds and vice versa, even if the relationship isn’t perfect. production would probably be better, and i would have no complaints about us changing this in a separate patch.

in the meantime, how would you like the profile to be prefixed in the data? it looks like the harmony benchmark results put it in the benchmark name, but i can’t find the code that does that?

delan · 2025-10-06T11:02:38Z

@sagudev this should be ready for review :)

sagudev

LGTM, although I would also wait @jschwe.

OFF-TOPIC: Why we do not run in speedometer on pushes on main https://github.com/servo/servo/actions/runs/18276911520/job/52031668893?

jschwe · 2025-10-09T11:13:31Z

in the meantime, how would you like the profile to be prefixed in the data? it looks like the harmony benchmark results put it in the benchmark name, but i can’t find the code that does that?

It's a bit hacky IMHO, but when running ./mach test-speedometer-ohos we pass the --profile parameter through to the function that converts the testresults to the bencher file format:

servo/python/servo/testing_commands.py

Lines 628 to 630 in 5887e1e

    
           def speedometer_to_bmf(self, speedometer: dict[str, Any], bmf_output: str, profile: str | None = None) -> None: 
        
               output = dict() 
        
               profile = "" if profile is None else profile + "/"

note that our runtime perf benchmarks already exclusively use the release profile on linux, and i don’t think it’s true that we are completely uninterested in them or that those metrics don’t matter at all. regressions in production builds are likely to be regressions in release builds and vice versa, even if the relationship isn’t perfect. production would probably be better, and i would have no complaints about us changing this in a separate patch.

Doing it in a different patch is fine, as long as we make sure the results are prefixed in some way, so that we wouldn't suddenly get a large perf improvement in some graph just because we switched the profile.

My main concern for the release profile is that the volatility is higher than in production, which directly constrains our ability to set a lower threshold for what a "regression" is.

Signed-off-by: Delan Azabani <dazabani@igalia.com>

delan · 2025-10-10T11:32:10Z

OFF-TOPIC: Why we do not run in speedometer on pushes on main https://github.com/servo/servo/actions/runs/18276911520/job/52031668893?

that seems to be a try run, did you mean to link a different run?

Doing it in a different patch is fine, as long as we make sure the results are prefixed in some way, so that we wouldn't suddenly get a large perf improvement in some graph just because we switched the profile.

thanks, done in 006de84 and 7c04b25.

servo-highfive added the S-awaiting-review There is new code that needs to be reviewed. label Sep 12, 2025

delan marked this pull request as ready for review September 12, 2025 13:35

delan requested a review from jschwe as a code owner September 12, 2025 13:35

delan mentioned this pull request Sep 12, 2025

Dedicated servers for benchmarking servo/project#160

Closed

delan commented Sep 12, 2025

View reviewed changes

.github/workflows/bencher.yml Show resolved Hide resolved

delan force-pushed the self-hosted-bencher branch from 2b7ff0d to f47e882 Compare September 15, 2025 09:57

delan marked this pull request as draft September 15, 2025 10:16

delan force-pushed the self-hosted-bencher branch 2 times, most recently from 0aeb133 to a74b4ee Compare September 15, 2025 11:24

delan marked this pull request as ready for review September 15, 2025 11:50

delan requested review from sagudev and removed request for jschwe September 15, 2025 11:50

jschwe reviewed Sep 17, 2025

View reviewed changes

.github/actions/runner-select/action.yml Show resolved Hide resolved

.github/workflows/bencher.yml Outdated Show resolved Hide resolved

jschwe reviewed Sep 17, 2025

View reviewed changes

delan marked this pull request as draft September 22, 2025 11:45

delan force-pushed the self-hosted-bencher branch from f2ef174 to 7002b13 Compare September 23, 2025 04:54

delan added 5 commits September 23, 2025 12:56

ci: Add ci3 and ci4 to the list of self-hosted runner servers

19eabb5

Signed-off-by: Delan Azabani <dazabani@igalia.com>

ci: Run bencher jobs on self-hosted runners

a927f58

Signed-off-by: Delan Azabani <dazabani@igalia.com>

ci: Skip some steps on self-hosted runners

8f7c440

Signed-off-by: Delan Azabani <dazabani@igalia.com>

Submit data for self-hosted builds under separate testbeds

4344cdb

Signed-off-by: Delan Azabani <dazabani@igalia.com>

Use one testbed for all self-hosted submissions

cc7dc1b

Signed-off-by: Delan Azabani <dazabani@igalia.com>

delan force-pushed the self-hosted-bencher branch from 7002b13 to cc7dc1b Compare September 23, 2025 05:12

delan mentioned this pull request Sep 30, 2025

Runner timeout does not work as expected #39569

Closed

delan marked this pull request as ready for review October 6, 2025 11:02

delan requested a review from jschwe October 6, 2025 11:02

sagudev approved these changes Oct 6, 2025

View reviewed changes

servo-highfive removed the S-awaiting-review There is new code that needs to be reviewed. label Oct 6, 2025

servo-highfive added the S-awaiting-review There is new code that needs to be reviewed. label Oct 10, 2025

Prefix speedometer results with the current Cargo profile

006de84

Signed-off-by: Delan Azabani <dazabani@igalia.com>

delan force-pushed the self-hosted-bencher branch from 13ebd0f to 006de84 Compare October 10, 2025 11:02

Prefix dromaeo results with the current Cargo profile

7c04b25

Signed-off-by: Delan Azabani <dazabani@igalia.com>

jschwe approved these changes Oct 10, 2025

View reviewed changes

servo-highfive removed the S-awaiting-review There is new code that needs to be reviewed. label Oct 10, 2025

delan added this pull request to the merge queue Oct 13, 2025

servo-highfive added the S-awaiting-merge The PR is in the process of compiling and running tests on the automated CI. label Oct 13, 2025

Merged via the queue into main with commit b675180 Oct 13, 2025
27 checks passed

delan deleted the self-hosted-bencher branch October 13, 2025 14:52

servo-highfive removed the S-awaiting-merge The PR is in the process of compiling and running tests on the automated CI. label Oct 13, 2025

Uh oh!

ci: Run bencher jobs on self-hosted runners #39272

ci: Run bencher jobs on self-hosted runners #39272

Uh oh!

Conversation

delan commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jschwe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

delan Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

delan commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

delan commented Oct 6, 2025

Uh oh!

sagudev left a comment

Choose a reason for hiding this comment

Uh oh!

jschwe commented Oct 9, 2025

Uh oh!

delan commented Oct 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

delan commented Sep 12, 2025 •

edited

Loading

delan Sep 23, 2025 •

edited

Loading

delan commented Oct 6, 2025 •

edited

Loading