-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(pt): detach computed descriptor tensor to prevent OOM #4547
base: devel
Are you sure you want to change the base?
Conversation
Fix deepmodeling#4544. Signed-off-by: Jinzhe Zeng <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.
📝 WalkthroughWalkthroughThe pull request modifies the Changes
Assessment against linked issues
Suggested labels
Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
deepmd/pt/model/atomic_model/dp_atomic_model.py (1)
247-247
: LGTM! Effective solution for preventing OOM during descriptor evaluation.The addition of
.detach()
is the correct approach here as it:
- Prevents unnecessary gradient tracking in the evaluation cache
- Maintains the original tensor's gradients for the fitting network
- Reduces memory usage by not retaining the computational graph for cached descriptors
This is a memory-efficient solution since:
- The evaluation hook only needs the tensor values, not the gradients
- The original descriptor tensor still maintains its gradients for the fitting network
- The detached tensor shares the same memory as the original tensor but without the gradient overhead
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deepmd/pt/model/atomic_model/dp_atomic_model.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (21)
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Analyze (python)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## devel #4547 +/- ##
==========================================
- Coverage 84.57% 84.57% -0.01%
==========================================
Files 677 677
Lines 63916 63915 -1
Branches 3486 3488 +2
==========================================
- Hits 54060 54059 -1
+ Misses 8730 8729 -1
- Partials 1126 1127 +1 ☔ View full report in Codecov by Sentry. |
@njzjz However, this change will fix another thing. While using the scripts below to generate descriptors in one python scripts:
Before this adjustment, the CUDA memory cannot be cleaned after each LabeledSystem evaluated, which is fixed after the adjustment of this PR. |
If so, I think this PR can still be merged, though I haven't found a way to resolve the original PR. |
Fix #4544.
Summary by CodeRabbit