Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] OOM in DeepPot.eval_descriptor while dp test works #4544

Open
QuantumMisaka opened this issue Jan 10, 2025 · 3 comments · May be fixed by #4547
Open

[BUG] OOM in DeepPot.eval_descriptor while dp test works #4544

QuantumMisaka opened this issue Jan 10, 2025 · 3 comments · May be fixed by #4547
Labels
bug reproduced This bug has been reproduced by developers

Comments

@QuantumMisaka
Copy link

Bug summary

While loading learned model in python env and use DeepPot.eval_descriptor function in my test LabeledSystem, there will by OOM error in my A100-40G hardware:

RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 39.42 GiB of which 3.56 MiB is free. Process 276556 has 30.06 MiB memory in use. Including non-PyTorch memory, this process has 39.38 GiB memory in use. Of the allocated memory 38.05 GiB is allocated by PyTorch, and 25.29 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/mps/liuzq/FeCHO-dpa2/300rc0/v2-7-direct-10p/desc-val/oom-example/calc_desc.py", line 53, in <module>
    desc = descriptor_from_model(onedata, model)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mps/liuzq/FeCHO-dpa2/300rc0/v2-7-direct-10p/desc-val/oom-example/calc_desc.py", line 26, in descriptor_from_model
    predict = model.eval_descriptor(coords, cells, atypes)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mps/miniconda3/envs/deepmd-3rc0/lib/python3.11/site-packages/deepmd/infer/deep_eval.py", line 445, in eval_descriptor
    descriptor = self.deep_eval.eval_descriptor(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mps/miniconda3/envs/deepmd-3rc0/lib/python3.11/site-packages/deepmd/pt/infer/deep_eval.py", line 649, in eval_descriptor
    self.eval(
  File "/home/mps/miniconda3/envs/deepmd-3rc0/lib/python3.11/site-packages/deepmd/pt/infer/deep_eval.py", line 292, in eval
    out = self._eval_func(self._eval_model, numb_test, natoms)(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mps/miniconda3/envs/deepmd-3rc0/lib/python3.11/site-packages/deepmd/pt/infer/deep_eval.py", line 364, in eval_func
    return self.auto_batch_size.execute_all(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mps/miniconda3/envs/deepmd-3rc0/lib/python3.11/site-packages/deepmd/utils/batch_size.py", line 197, in execute_all
    n_batch, result = self.execute(execute_with_batch_size, index, natoms)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mps/miniconda3/envs/deepmd-3rc0/lib/python3.11/site-packages/deepmd/utils/batch_size.py", line 120, in execute
    raise OutOfMemoryError(
deepmd.utils.errors.OutOfMemoryError: The callable still throws an out-of-memory (OOM) error even when batch size is 1!

But the dp test from the same model in the same LabeledSystem dataset can be done with ~39GB memory used first then become lower to 28G memory usage.

DeePMD-kit Version

DeePMD-kit v3.0.0rc1.dev0+g0ad42893.d20250106

Backend and its version

Pytorch 2.5.1

How did you download the software?

pip

Input Files, Running Commands, Error Log, etc.

All reference files can be accessed by dp_eval_desc_oom.tar.gz

or though Nutshull Cloud
https://www.jianguoyun.com/p/DV3eAQoQrZ-XCRim4ecFIAA (code : unpbns)

Steps to Reproduce

tar -zxvf dp_eval_desc_oom.tar.gz
cd oom-example

There will be these scripts

calc_desc.py  data  desc_all.sh  desc.log  descriptors  model.pth  test.log  test.sh

explained as follow:

  • calc_desc.py use DeepPot.eval_descriptor() to generate descriptor of a LabeledSystem
  • desc_all.sh read MultiSystems from data and call calc_desc.py iteratively
  • data directory contain the LabeledSystems which will lead to OOM in eval_descriptor
  • model.pth is the model in use
  • test.sh is the script calling dp --pt test to output test.log
  • desc.log is the OOM stderr print-out
  • descrtptors is the directory aim to contain the output descriptor, which should be empty due to OOM

One can checkout the OOM problem directly by these.

Further Information, Files, and Links

Related issue: #4533

if one directly use DeepPot.eval() function in python code, the likely OOM problem will also emerge (from my previous test in Jan 2024), So I guess there are some difference in dp test on cmd and directly use evaluation interface in python.

@QuantumMisaka QuantumMisaka changed the title [BUG] OOM in DeepPot.eval_descriptor wile dp test works [BUG] OOM in DeepPot.eval_descriptor while dp test works Jan 10, 2025
@njzjz
Copy link
Member

njzjz commented Jan 11, 2025

Perhaps because the descriptor tensor is not detached

njzjz added a commit to njzjz/deepmd-kit that referenced this issue Jan 11, 2025
@njzjz njzjz linked a pull request Jan 11, 2025 that will close this issue
@njzjz njzjz added the reproduced This bug has been reproduced by developers label Jan 17, 2025
@njzjz
Copy link
Member

njzjz commented Jan 17, 2025

I can reproduce the error, though I am not sure how to fix it.

@QuantumMisaka
Copy link
Author

QuantumMisaka commented Jan 17, 2025

@njzjz I found that if replace the eval_descriptor for whole LabeledSystem

desc = descriptor_from_model(onedata, model)

to eval_descriptor for one System in whole LabeledSystem by a for-loop

desc_list = []
for onesys in onedata:
    desc_onesys = descriptor_from_model(onesys, model)
    desc_list.append(desc_onesys)
desc = np.concatenate(desc_list, axis=0)

(in calc_desc.py script), the OOM problem will disappear, and the memory comsumption will be kept lower than 3GB

The test is done after modification of #4547

I consider that the key for OOM may be in the evaluation for a whole LabeledSystem once for all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug reproduced This bug has been reproduced by developers
Projects
Development

Successfully merging a pull request may close this issue.

2 participants