| CARVIEW |
Navigation Menu
-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
ci: Run bencher jobs on self-hosted runners #39272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2b7ff0d to
f47e882
Compare
0aeb133 to
a74b4ee
Compare
jschwe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One overall concern I have: We are not actually interested in the release profile, since release enables debug assertions, which has a major perf impact.
To keep the data on bencher clean, I think we should prefix the benchmark names (or some other thing we can filter on) with the cargo profile. Or even better, just always benchmark in production mode, since that is the only metric that matters.
| uses: ./.github/actions/runner-select | ||
| with: | ||
| monitor-api-token: ${{ secrets.SERVO_CI_MONITOR_API_TOKEN }} | ||
| github-hosted-runner-label: ubuntu-22.04 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a usecase for falling back to github hosted runners for benchmarks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general, we try to make our usage of self-hosted runners gracefully degrade to github-hosted runners, because self-hosted runners are provided on a best-effort basis. and for bencher jobs, there are cases where we only measure binary size, so we use force-github-hosted-runner to force those cases to use github-hosted runners.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are cases where we only measure binary size, so we use force-github-hosted-runner to force those cases to use github-hosted runners.
I already lost context here (and am a bit time constrained now, so can't look at the code again), but do non-perf related metrics like binary size need to be handled by the bencher workflow? I would imagine that the metric could be a regular output of the build workflow (and either directly uploaded, or if tokens are a problem, eventually uploaded by the bencher perf job).
In general, we try to make our usage of self-hosted runners gracefully degrade to github-hosted runners,
I think that is great, but at least for perf jobs it doesn't really make sense to me, since the perf measurements on github-hosted runners are not useful in practice due to the high volatility (counting instructions might work there, but that metric is also not incredibly useful).
I'm mainly wondering if the added complexity (to handle github-hosted runners) here is really needed or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already lost context here (and am a bit time constrained now, so can't look at the code again), but do non-perf related metrics like binary size need to be handled by the bencher workflow? I would imagine that the metric could be a regular output of the build workflow (and either directly uploaded, or if tokens are a problem, eventually uploaded by the bencher perf job).
i agree. i think this would be a good rework to do in a separate patch.
I think that is great, but at least for perf jobs it doesn't really make sense to me, since the perf measurements on github-hosted runners are not useful in practice due to the high volatility (counting instructions might work there, but that metric is also not incredibly useful).
that’s fair, yea, do you think we should skip running speedometer and dromaeo if no self-hosted runners are available?
I'm mainly wondering if the added complexity (to handle github-hosted runners) here is really needed or not.
i think this is the approach with the least added complexity overall, because it takes the existing bencher job and self-hosts it using the same pattern for selecting a runner as most of our other self-hosted jobs.
without fallback (even to a github-hosted no-op job), any downtime in the self-hosted runners would break everyone’s builds, which is currently a non-starter imo. i’m working towards a world where we don’t rely on github-hosted runners and our self-hosted runners are zero maintenance, but we’re not quite there yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that’s fair, yea, do you think we should skip running speedometer and dromaeo if no self-hosted runners are available?
Probably we should avoid uploading metrics we aren't going to use (since there presumably is a cost, and bencher.dev is a single-person project as far as I understand). I think bencher also recently introduced rate limiting (for non paying orgs).
i think this is the approach with the least added complexity overall, because it takes the existing bencher job and self-hosts it using the same pattern for selecting a runner as most of our other self-hosted jobs.
That's fair.
without fallback (even to a github-hosted no-op job), any downtime in the self-hosted runners would break everyone’s builds, which is currently a non-starter imo.
I'm not sure I agree here. I think we do need to transition towards a model where performance benchmarking is mandatory for pull requests to be merged. Performance regressions are just as serious as test regressions, so if we can't measure, then the MQ being broken is a feature in my opinion. The main problem here in my opinion is that we don't really have redundancy in terms of infrastructure maintainers. But I guess this discussion might be better had on zulip in the TSC channel to have a bit more visibility.
If we could trigger the performance testing workflow manually, and backfill performance data after a self-hosted runner downtime, so we can easily identify regressions merged to main (and revert), that could also be a viable short-term - medium-term bandaid. I'm not sure how feasible that would be though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already lost context here (and am a bit time constrained now, so can't look at the code again), but do non-perf related metrics like binary size need to be handled by the bencher workflow? I would imagine that the metric could be a regular output of the build workflow (and either directly uploaded, or if tokens are a problem, eventually uploaded by the bencher perf job).
This was done because bencher requires to send all data for single point (if we would send binary size from build workflow it would show as separate point).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was inclined to agree with you that github-hosted benchmark results are not useful, but issues like #39641 show that even our existing github-hosted benchmarking can be very useful. i do expect that self-hosted benchmark results will be more useful though.
My assumption is that we would want to prevent PRs that regress performance from getting merged eventually (but rather sooner than later). We definitely need selfhosted runners to be able to measure accurately enough to be able to add thresholds that are low enough to be generally useful. GitHub hosted runners could be used to prevent major regressions, but anything smaller than lets say 10-20% would easily fall through the cracks given the volatility we see. I agree that it is better than nothing, but we would also need to measure with github-hosted runners on every commit to establish a baseline - Since our utilization is high, and we often use 100% of our github-hosted runners, I'm not sure if this is worth it over just accepting that the merge queue would be broken if our self-hosted benchmark runners are down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the current relationship we have with the self-hosted runners is that they’re provided on a best-effort basis, because no one is on call for outages. if you’re interested in changing that, we should discuss how to get there with the broader servo community. i don’t think this patch is the right place to unilaterally change that relationship and decide it’s acceptable for the merge queue to be broken for hours or days at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think moving to self-hosted runners is independent of this discussion, so I'd propose we move to the self-hosted runners first and then we discuss in Zulip about how strict we want to be.
BTW, in other projects like Chromium you don't get blocked due to performance regressions, the patch lands and then you get notified afterwards about the regressions, sometimes the patch has to be reverted, other times a fix can get merged and solve the regression. Just to show how other projects do this. The good part in Chromium is that the notifications are automatic opening a bug and tagging the person that landed the original patch, if we could do something like that in Bencher would be quite usefull.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don’t think this patch is the right place to unilaterally change that relationship and decide it’s acceptable for the merge queue to be broken for hours or days at a time.
My intention was not to suggest that we do this now as part of this patch. I mainly meant to state that I believe long-term we want something like that (and yes that would discussions on how to practically implement it). I was mainly wondering if we could just not require the bencher workflows for the MQ (instead of falling back), and make it a requirement once self-hosted is stable enough, and we have relevant procedures inplace. But this PR has been cooking for long enough, so I'd be fine with merging as is and picking up the discussion another time.
BTW, in other projects like Chromium you don't get blocked due to performance regressions, the patch lands and then you get notified afterwards about the regressions, sometimes the patch has to be reverted, other times a fix can get merged and solve the regression.
I only have second-hand knowledge about this, but my understanding was that chromium CI does have multiple layers, and there are performance tests run before merging (although the number of test-configurations run only after merging is considerably larger).
The good part in Chromium is that the notifications are automatic opening a bug and tagging the person that landed the original patch, if we could do something like that in Bencher would be quite usefull.
We can setup thresholds in bencher, which would trigger a warning on bencher (as a comment in the PR if run before merging), but the usefulness of thresholds correlates strongly with the stability of the results.
f2ef174 to
7002b13
Compare
Signed-off-by: Delan Azabani <dazabani@igalia.com>
Signed-off-by: Delan Azabani <dazabani@igalia.com>
Signed-off-by: Delan Azabani <dazabani@igalia.com>
Signed-off-by: Delan Azabani <dazabani@igalia.com>
Signed-off-by: Delan Azabani <dazabani@igalia.com>
7002b13 to
cc7dc1b
Compare
note that our runtime perf benchmarks already exclusively use the release profile on linux, and i don’t think it’s true that we are completely uninterested in them or that those metrics don’t matter at all. regressions in production builds are likely to be regressions in release builds and vice versa, even if the relationship isn’t perfect. production would probably be better, and i would have no complaints about us changing this in a separate patch. in the meantime, how would you like the profile to be prefixed in the data? it looks like the harmony benchmark results put it in the benchmark name, but i can’t find the code that does that? |
|
@sagudev this should be ready for review :) |
sagudev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, although I would also wait @jschwe.
OFF-TOPIC: Why we do not run in speedometer on pushes on main https://github.com/servo/servo/actions/runs/18276911520/job/52031668893?
It's a bit hacky IMHO, but when running servo/python/servo/testing_commands.py Lines 628 to 630 in 5887e1e
Doing it in a different patch is fine, as long as we make sure the results are prefixed in some way, so that we wouldn't suddenly get a large perf improvement in some graph just because we switched the profile. My main concern for the release profile is that the volatility is higher than in production, which directly constrains our ability to set a lower threshold for what a "regression" is. |
Signed-off-by: Delan Azabani <dazabani@igalia.com>
13ebd0f to
006de84
Compare
Signed-off-by: Delan Azabani <dazabani@igalia.com>
that seems to be a try run, did you mean to link a different run?
|
some bencher jobs, specifically linux release profile jobs, measure runtime perf by running speedometer and dromaeo. doing this on GitHub-hosted runners is suboptimal, because GitHub-hosted runners are not under our control and their performance can vary wildly depending on what hosts we get and how busy they are.
this patch depends on #39270 and #39271, and the new ci3 and ci4 servers deployed in servo/ci-runners#49. these servers provide a more controlled environment for benchmarking by using known hardware that runs one job at a time and no other work (servo/project#160), with some of the techniques we’ve developed for accurate measurement:
to use ci3 and ci4 for bencher jobs, we add them to the list of self-hosted runner servers, then make the bencher workflow try to find a servo-ubuntu2204-bench runner if speedometer and/or dromaeo have been requested. to avoid mixing data, we set the bencher “testbed” based on where the runner came from:
Testing:
Fixes: #39269