LAMMPS GPU Parallelization Issue in DP-GEN Workflow #4168

chenggoj · 2024-09-28T17:34:59Z

chenggoj
Sep 28, 2024

Hello DeePMD-kit community,

I'm currently using DP-GEN to manage an automated workflow with DeePMD-kit 2.2.10 (GPU version) installed from the conda channel. My setup involves running the training (DeePMD-kit), exploration (LAMMPS), and labeling (VASP-6.4 GPU version) processes on a single GPU node equipped with 4 Nvidia Tesla V100 GPU cards, 360 GB memory, and 40 AMD CPU cores.

While testing the DP-GEN workflow with the official test data, I've encountered an issue specific to LAMMPS GPU utilization:

DeePMD-kit and VASP successfully utilize all 4 GPUs in parallel.
When running LAMMPS with the "lmp" command, only one GPU is utilized at 100% load, while the other three remain at 0% load.
Attempting to use "mpirun -np 4 lmp" results in the process hanging, with all GPUs showing 0% load.

My question is: Does the pre-compiled LAMMPS version (lmp) provided with DeePMD-kit not support GPU parallelization? If it does support GPU parallelization, what might be causing this issue, and how can I resolve it to fully utilize all available GPUs for the LAMMPS exploration step in my DP-GEN workflow? Do I need to recompile the LAMMPS by myself?

Any insights or suggestions would be greatly appreciated. Thank you in advance for your help!

System details:

DeePMD-kit version: 2.2.10 (GPU version)
LAMMPS version: stable_2Aug2023_update3
GPU: 4x Nvidia Tesla V100
CPU: 40 cores AMD
Memory: 360 GB
Slurm queue in SDSC-Expanse cluster

Here is the settings in machine.json;
"model_devi":[
{
"command": "CUDA_VISIBLE_DEVICES=0,1,2,3 lmp",
"machine": {
"batch_type": "Shell",
"context_type": "local",
"local_root" : "./",
"remote_root": "/expanse/lustre/scratch/projectxxx/LAMMPS_exploration",
"clean_asynchronously": true
},
"resources": {
"number_node": 1,
"cpu_per_node": 40,
"gpu_per_node": 4,
"group_size": 0,
"strategy": {
"if_cuda_multi_devices": true
},
"module_purge": true,
"source_list": [
"activate deepmd-gpu-2.2.10"
],
"module_list":[
"slurm/expanse/23.02.7",
"gpu/0.17.3b",
"sdsc/1.0",
"cuda11.7/toolkit/11.7.1",
"gcc/10.2.0/i62tgso",
"intel-mpi/2019.10.317/jhyxn2g"
],

                    "envs": {
                            "OMP_NUM_THREADS": 1,
                            "TF_INTRA_OP_PARALLELISM_THREADS": 1,
                            "TF_INTER_OP_PARALLELISM_THREADS": 1,
                            "OMPI_MCA_btl_openib_allow_ib": 1,
                            "OMPI_MCA_btl_openib_if_include": "lx5_0:1"
                             },
                    "para_deg": 4
                    }
            }
    ],

Here is the screenshot of nvidia-smi:

Answered by njzjz

Sep 28, 2024

When using para_deg with GPUs, if_cuda_multi_devices shoud be set to true.

Different MPI libraries may not be ABI compatible, as the MPI ABI standard hasn't been adopted. So, do not mix the MPIs from the cluster and from Conda.

View full answer

njzjz · 2024-09-28T22:10:55Z

njzjz
Sep 28, 2024
Maintainer

When using para_deg with GPUs, if_cuda_multi_devices shoud be set to true.

Different MPI libraries may not be ABI compatible, as the MPI ABI standard hasn't been adopted. So, do not mix the MPIs from the cluster and from Conda.

3 replies

chenggoj Sep 29, 2024
Author

Thanks for your reply. So do you mean only using the MPIs in the DeePMD-kit package is enough?

njzjz Sep 30, 2024
Maintainer

Could you list your packages? I am wondering if it is openmpi.

chenggoj Sep 30, 2024
Author

Thanks for your reply. I am using the intel-mpi package. But I think I have resolved it after I rebuilt the latest version of DeePMD-kit from conda-forge. Now it is working well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LAMMPS GPU Parallelization Issue in DP-GEN Workflow #4168

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

LAMMPS GPU Parallelization Issue in DP-GEN Workflow #4168

chenggoj Sep 28, 2024

Replies: 1 comment · 3 replies

njzjz Sep 28, 2024 Maintainer

chenggoj Sep 29, 2024 Author

njzjz Sep 30, 2024 Maintainer

chenggoj Sep 30, 2024 Author

chenggoj
Sep 28, 2024

Replies: 1 comment 3 replies

njzjz
Sep 28, 2024
Maintainer

chenggoj Sep 29, 2024
Author

njzjz Sep 30, 2024
Maintainer

chenggoj Sep 30, 2024
Author