CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 24.7k
[BE] Improve Flight Recorder efficacy #142178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142178
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit a5411f7 with merge base c6e18a1 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D66843194 |
Summary: This is an attempt to improve the flight recorder efficacy. We have a small subset of jobs that are timing out (i.e. failing to write out FR logs in 1 minute) and some that are throwing a `std::exception - broken promise`. There are two changes in here. 1. We attempt to write out FR buffer with stack traces. If this fails, we attempt to capture FR buffer again - but this time without stack traces. The assumption here is that FR could be locking up when unwinding stack. Note, to keep things simple, I'm re-using the same file name for both with/without stack_trace. 2. Add additional catch statements in the Manifold writer. There might be something going on in here - so we'll get a log statement if this is failing. TODO: - there's nothing differentiating in the output that says whether stack traces were omitted purposefully or not. This info might be useful for the analyzer - so I'll add this in a follow on diff. Test Plan: Unit tests. Differential Revision: D66843194
67d1266
to
25485f4
Compare
This pull request was exported from Phabricator. Differential Revision: D66843194 |
25485f4
to
c0ee84b
Compare
Summary: This is an attempt to improve the flight recorder efficacy. We have a small subset of jobs that are timing out (i.e. failing to write out FR logs in 1 minute) and some that are throwing a `std::exception - broken promise`. There are two changes in here. 1. We attempt to write out FR buffer with stack traces. If this fails, we attempt to capture FR buffer again - but this time without stack traces. The assumption here is that FR could be locking up when unwinding stack. Note, to keep things simple, I'm re-using the same file name for both with/without stack_trace. 2. Add additional catch statements in the Manifold writer. There might be something going on in here - so we'll get a log statement if this is failing. TODO: - there's nothing differentiating in the output that says whether stack traces were omitted purposefully or not. This info might be useful for the analyzer - so I'll add this in a follow on diff. Test Plan: Unit tests. Differential Revision: D66843194
This pull request was exported from Phabricator. Differential Revision: D66843194 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Have some minor comments.
auto status = promiseFlightRecorderDump_.get_future().wait_for( | ||
std::chrono::milliseconds(waitTimeoutDumpInMilSec_)); | ||
std::chrono::milliseconds(2 * waitTimeoutDumpInMilSec_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: waitTimeoutDumpInMilSec_
is not used elsewhere, maybe we can directly bump its value by 2x? (easier maintenance I guess)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the watchdog needs to give 2 mins for heartbeat thread to try to dump (1 min for with stack trace and if this fails, 1 min without stack trace).
Summary: This is an attempt to improve the flight recorder efficacy. We have a small subset of jobs that are timing out (i.e. failing to write out FR logs in 1 minute) and some that are throwing a `std::exception - broken promise`. There are two changes in here. 1. We attempt to write out FR buffer with stack traces. If this fails, we attempt to capture FR buffer again - but this time without stack traces. The assumption here is that FR could be locking up when unwinding stack. Note, to keep things simple, I'm re-using the same file name for both with/without stack_trace. 2. Add additional catch statements in the Manifold writer. There might be something going on in here - so we'll get a log statement if this is failing. TODO: - there's nothing differentiating in the output that says whether stack traces were omitted purposefully or not. This info might be useful for the analyzer - so I'll add this in a follow on diff. Test Plan: Unit tests. Reviewed By: kwen2501 Differential Revision: D66843194
c0ee84b
to
3160715
Compare
This pull request was exported from Phabricator. Differential Revision: D66843194 |
Summary: This is an attempt to improve the flight recorder efficacy. We have a small subset of jobs that are timing out (i.e. failing to write out FR logs in 1 minute) and some that are throwing a `std::exception - broken promise`. There are two changes in here. 1. We attempt to write out FR buffer with stack traces. If this fails, we attempt to capture FR buffer again - but this time without stack traces. The assumption here is that FR could be locking up when unwinding stack. Note, to keep things simple, I'm re-using the same file name for both with/without stack_trace. 2. Add additional catch statements in the Manifold writer. There might be something going on in here - so we'll get a log statement if this is failing. TODO: - there's nothing differentiating in the output that says whether stack traces were omitted purposefully or not. This info might be useful for the analyzer - so I'll add this in a follow on diff. Test Plan: Unit tests. Reviewed By: kwen2501 Differential Revision: D66843194
3160715
to
675be65
Compare
This pull request was exported from Phabricator. Differential Revision: D66843194 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to do a roll out on this or it is ok that we directly have it in the code?
✅ Deploy Preview for chimerical-cranachan-793287 ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Summary: This is an attempt to improve the flight recorder efficacy. We have a small subset of jobs that are timing out (i.e. failing to write out FR logs in 1 minute) and some that are throwing a `std::exception - broken promise`. There are two changes in here. 1. We attempt to write out FR buffer with stack traces. If this fails, we attempt to capture FR buffer again - but this time without stack traces. The assumption here is that FR could be locking up when unwinding stack. Note, to keep things simple, I'm re-using the same file name for both with/without stack_trace. 2. Add additional catch statements in the Manifold writer. There might be something going on in here - so we'll get a log statement if this is failing. TODO: - there's nothing differentiating in the output that says whether stack traces were omitted purposefully or not. This info might be useful for the analyzer - so I'll add this in a follow on diff. Test Plan: Unit tests. Reviewed By: kwen2501 Differential Revision: D66843194
675be65
to
56f3b1e
Compare
This pull request was exported from Phabricator. Differential Revision: D66843194 |
Summary: This is an attempt to improve the flight recorder efficacy. We have a small subset of jobs that are timing out (i.e. failing to write out FR logs in 1 minute) and some that are throwing a `std::exception - broken promise`. There are two changes in here. 1. We attempt to write out FR buffer with stack traces. If this fails, we attempt to capture FR buffer again - but this time without stack traces. The assumption here is that FR could be locking up when unwinding stack. Note, to keep things simple, I'm re-using the same file name for both with/without stack_trace. 2. Add additional catch statements in the Manifold writer. There might be something going on in here - so we'll get a log statement if this is failing. TODO: - there's nothing differentiating in the output that says whether stack traces were omitted purposefully or not. This info might be useful for the analyzer - so I'll add this in a follow on diff. Test Plan: Unit tests. Reviewed By: kwen2501 Differential Revision: D66843194
56f3b1e
to
e4e0768
Compare
Summary: This is an attempt to improve the flight recorder efficacy. We have a small subset of jobs that are timing out (i.e. failing to write out FR logs in 1 minute) and some that are throwing a `std::exception - broken promise`. There are two changes in here. 1. We attempt to write out FR buffer with stack traces. If this fails, we attempt to capture FR buffer again - but this time without stack traces. The assumption here is that FR could be locking up when unwinding stack. Note, to keep things simple, I'm re-using the same file name for both with/without stack_trace. 2. Add additional catch statements in the Manifold writer. There might be something going on in here - so we'll get a log statement if this is failing. TODO: - there's nothing differentiating in the output that says whether stack traces were omitted purposefully or not. This info might be useful for the analyzer - so I'll add this in a follow on diff. Test Plan: Unit tests. Reviewed By: kwen2501 Differential Revision: D66843194
e4e0768
to
630a046
Compare
This pull request was exported from Phabricator. Differential Revision: D66843194 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D66843194 |
Summary: This is an attempt to improve the flight recorder efficacy. We have a small subset of jobs that are timing out (i.e. failing to write out FR logs in 1 minute) and some that are throwing a `std::exception - broken promise`. There are two changes in here. 1. We attempt to write out FR buffer with stack traces. If this fails, we attempt to capture FR buffer again - but this time without stack traces. The assumption here is that FR could be locking up when unwinding stack. Note, to keep things simple, I'm re-using the same file name for both with/without stack_trace. 2. Add additional catch statements in the Manifold writer. There might be something going on in here - so we'll get a log statement if this is failing. TODO: - there's nothing differentiating in the output that says whether stack traces were omitted purposefully or not. This info might be useful for the analyzer - so I'll add this in a follow on diff. Test Plan: Unit tests. Reviewed By: kwen2501 Differential Revision: D66843194
630a046
to
ee86d36
Compare
This pull request was exported from Phabricator. Differential Revision: D66843194 |
Summary: This is an attempt to improve the flight recorder efficacy. We have a small subset of jobs that are timing out (i.e. failing to write out FR logs in 1 minute) and some that are throwing a `std::exception - broken promise`. There are two changes in here. 1. We attempt to write out FR buffer with stack traces. If this fails, we attempt to capture FR buffer again - but this time without stack traces. The assumption here is that FR could be locking up when unwinding stack. Note, to keep things simple, I'm re-using the same file name for both with/without stack_trace. 2. Add additional catch statements in the Manifold writer. There might be something going on in here - so we'll get a log statement if this is failing. TODO: - there's nothing differentiating in the output that says whether stack traces were omitted purposefully or not. This info might be useful for the analyzer - so I'll add this in a follow on diff. Test Plan: Unit tests. Reviewed By: kwen2501 Differential Revision: D66843194
ee86d36
to
a5411f7
Compare
This pull request was exported from Phabricator. Differential Revision: D66843194 |
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary: This is an attempt to improve the flight recorder efficacy. We have a small subset of jobs that are timing out (i.e. failing to write out FR logs in 1 minute) and some that are throwing a `std::exception - broken promise`. There are two changes in here. 1. We attempt to write out FR buffer with stack traces. If this fails, we attempt to capture FR buffer again - but this time without stack traces. The assumption here is that FR could be locking up when unwinding stack. Note, to keep things simple, I'm re-using the same file name for both with/without stack_trace. 2. Add additional catch statements in the Manifold writer. There might be something going on in here - so we'll get a log statement if this is failing. TODO: - there's nothing differentiating in the output that says whether stack traces were omitted purposefully or not. This info might be useful for the analyzer - so I'll add this in a follow on diff. Test Plan: Unit tests. Differential Revision: D66843194 Pull Request resolved: pytorch#142178 Approved by: https://github.com/kwen2501
Summary:
This is an attempt to improve the flight recorder efficacy.
We have a small subset of jobs that are timing out (i.e. failing to write out FR logs in 1 minute) and some that are throwing a
std::exception - broken promise
.There are two changes in here.
Note, to keep things simple, I'm re-using the same file name for both with/without stack_trace.
TODO:
This info might be useful for the analyzer - so I'll add this in a follow on diff.
Test Plan: Unit tests.
Differential Revision: D66843194
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k