CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 24.7k
[CI] Support for CI GPU test and benchmark on containers #137169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
π Helpful Linksπ§ͺ See artifacts and rendered test results at hud.pytorch.org/pr/137169
Note: Links to docs will display an error until the docs builds have been completed. β 1 New Failure, 1 Unrelated FailureAs of commit 3273b62 with merge base 52d29a2 ( NEW FAILURE - The following job has failed:
BROKEN TRUNK - The following job failed but were present on the merge base:π Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
run: echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}" | ||
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }} | ||
|
||
- name: Setup SCCACHE_SERVER_PORT environment for docker run when on container |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if we really need to set SCCACHE_SERVER_PORT
here? If it doesn't help with sccache, I wonder if we need to keep the step (there are 2 copies of this step here and in _linux_test.yml
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a interesting question, maybe yes, maybe not. It is not very likely that running it in a containerized environment this would be a problem. If you believe that there are potential risks on doing this we can remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the downsize is we might have a redundant step that does nothing in an already complex workflow.
|
||
- name: Lock NVIDIA A100 40GB Frequency | ||
run: | | ||
sudo nvidia-smi -pm 1 | ||
sudo nvidia-smi -ac 1215,1410 | ||
nvidia-smi | ||
if: contains(matrix.runner, 'a100') | ||
if: ${{ contains(matrix.runner, 'a100') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are still some reference to gcp.a100
in the workflows by searching for that string gcp.a100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not migrating from gcp.a100
to aws.a100
in this PR, we first need those changes merged on main and other's PRs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I mean there are step like Setup SSH
where it checks if the runner contains gcp.a100
. That can be changed to a100
. I get the idea that this is not the cut over PR from gcp.100
to aws.100
thus the question about rollout plan below :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's chat later on how you plan to roll this out, we would probably want to use both gcp and aws A100 for a week or so and compare the benchmark numbers
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 2 checks: pull / linux-focal-py3.12-clang10 / test (dynamo, 3, 3, lf.linux.2xlarge), inductor-periodic / cuda12.1-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Renames the arc references to container, and add changes required so CI that requires GPU can run on containers