You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 0 WindowlessContext: Unable to create windowless context
#2527
Open
zhangyu0110 opened this issue
Jan 3, 2025
· 0 comments
Hello, I am using habitat-sim 0.1.7 in a Docker container. When I train with one 3090 GPU, everything works fine, but when I use two GPUs, the following error occurs. Could you please help me understand why?
/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
FutureWarning,
WARNING:torch.distributed.run:*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your appli
cation as needed. *****************************************
2025-01-02 13:50:51,734 Initializing dataset VLN-CE-v1
2025-01-02 13:50:51,734 Initializing dataset VLN-CE-v1
2025-01-02 13:50:52,398 SPLTI: train, NUMBER OF SCENES: 61
2025-01-02 13:50:52,398 SPLTI: train, NUMBER OF SCENES: 61
2025-01-02 13:50:55,648 Initializing dataset VLN-CE-v1
2025-01-02 13:50:55,650 Initializing dataset VLN-CE-v1
2025-01-02 13:50:55,717 Initializing dataset VLN-CE-v1
2025-01-02 13:50:55,720 Initializing dataset VLN-CE-v1
2025-01-02 13:50:55,727 Initializing dataset VLN-CE-v1
2025-01-02 13:50:55,731 Initializing dataset VLN-CE-v1
2025-01-02 13:50:55,731 Initializing dataset VLN-CE-v1
2025-01-02 13:50:55,731 Initializing dataset VLN-CE-v1
2025-01-02 13:50:56,349 initializing sim Sim-v1
2025-01-02 13:50:56,351 initializing sim Sim-v1
Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 0
WindowlessContext: Unable to create windowless context
Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 1
WindowlessContext: Unable to create windowless context
2025-01-02 13:50:56,430 initializing sim Sim-v1
2025-01-02 13:50:56,432 initializing sim Sim-v1
2025-01-02 13:50:56,436 initializing sim Sim-v1
2025-01-02 13:50:56,440 initializing sim Sim-v1
2025-01-02 13:50:56,443 initializing sim Sim-v1
2025-01-02 13:50:56,444 initializing sim Sim-v1
Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 0
WindowlessContext: Unable to create windowless context
Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 1
WindowlessContext: Unable to create windowless context
Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 0
WindowlessContext: Unable to create windowless context
Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 1
WindowlessContext: Unable to create windowless context
Traceback (most recent call last):
File "run.py", line 113, in
Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 1
WindowlessContext: Unable to create windowless context
main()
File "run.py", line 49, in main
Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 0
WindowlessContext: Unable to create windowless context run_exp(**vars(args))
File "run.py", line 106, in run_exp
trainer.train()
File "/home/ETPNav/vlnce_baselines/ss_trainer_ETP.py", line 451, in train
observation_space, action_space = self._init_envs()
File "/home/ETPNav/vlnce_baselines/ss_trainer_ETP.py", line 168, in _init_envs
auto_reset_done=False
File "/home/ETPNav/vlnce_baselines/common/env_utils.py", line 122, in construct_envs
workers_ignore_signals=workers_ignore_signals,
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 194, in init
read_fn() for read_fn in self._connection_read_fns
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 194, in
read_fn() for read_fn in self._connection_read_fns
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 97, in call
res = self.read_fn()
File "/home/ETPNav/habitat-lab/habitat/utils/pickle5_multiprocessing.py", line 68, in recv
buf = self.recv_bytes()
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Traceback (most recent call last):
File "run.py", line 113, in
main()
File "run.py", line 49, in main
run_exp(**vars(args))
File "run.py", line 106, in run_exp
trainer.train()
File "/home/ETPNav/vlnce_baselines/ss_trainer_ETP.py", line 451, in train
observation_space, action_space = self._init_envs()
File "/home/ETPNav/vlnce_baselines/ss_trainer_ETP.py", line 168, in _init_envs
auto_reset_done=False
File "/home/ETPNav/vlnce_baselines/common/env_utils.py", line 122, in construct_envs
workers_ignore_signals=workers_ignore_signals,
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 194, in init
read_fn() for read_fn in self._connection_read_fns
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 194, in
read_fn() for read_fn in self._connection_read_fns
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 97, in call
res = self.read_fn()
File "/home/ETPNav/habitat-lab/habitat/utils/pickle5_multiprocessing.py", line 68, in recv
buf = self.recv_bytes()
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Exception ignored in: <bound method VectorEnv.del of <habitat.core.vector_env.VectorEnv object at 0x7fec60fd9358>>
Traceback (most recent call last):
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 588, in del
self.close()
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 456, in close
read_fn()
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 97, in call
res = self.read_fn()
File "/home/ETPNav/habitat-lab/habitat/utils/pickle5_multiprocessing.py", line 68, in recv
buf = self.recv_bytes()
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError:
Exception ignored in: <bound method VectorEnv.del of <habitat.core.vector_env.VectorEnv object at 0x7f8dac95c358>>
Traceback (most recent call last):
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 588, in del
self.close()
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 456, in close
read_fn()
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 97, in call
res = self.read_fn()
File "/home/ETPNav/habitat-lab/habitat/utils/pickle5_multiprocessing.py", line 68, in recv
buf = self.recv_bytes()
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError:
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2299562) of binary: /root/miniconda3/envs/vlnce/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/vlnce/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/root/miniconda3/envs/vlnce/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/run.py", line 692, in run
)(*cmd_args)
File "/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Hello, I am using habitat-sim 0.1.7 in a Docker container. When I train with one 3090 GPU, everything works fine, but when I use two GPUs, the following error occurs. Could you please help me understand why?
CUDA_VISIBLE_DEVICES=0,1 bash run_r2r/main.bash train 2333
train mode
/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects
--local_rank
argument to be set, pleasechange it to read from
os.environ['LOCAL_RANK']
instead. Seehttps://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
FutureWarning,
WARNING:torch.distributed.run:*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your appli
cation as needed. *****************************************
2025-01-02 13:50:51,734 Initializing dataset VLN-CE-v1
2025-01-02 13:50:51,734 Initializing dataset VLN-CE-v1
2025-01-02 13:50:52,398 SPLTI: train, NUMBER OF SCENES: 61
2025-01-02 13:50:52,398 SPLTI: train, NUMBER OF SCENES: 61
2025-01-02 13:50:55,648 Initializing dataset VLN-CE-v1
2025-01-02 13:50:55,650 Initializing dataset VLN-CE-v1
2025-01-02 13:50:55,717 Initializing dataset VLN-CE-v1
2025-01-02 13:50:55,720 Initializing dataset VLN-CE-v1
2025-01-02 13:50:55,727 Initializing dataset VLN-CE-v1
2025-01-02 13:50:55,731 Initializing dataset VLN-CE-v1
2025-01-02 13:50:55,731 Initializing dataset VLN-CE-v1
2025-01-02 13:50:55,731 Initializing dataset VLN-CE-v1
2025-01-02 13:50:56,349 initializing sim Sim-v1
2025-01-02 13:50:56,351 initializing sim Sim-v1
Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 0
WindowlessContext: Unable to create windowless context
Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 1
WindowlessContext: Unable to create windowless context
2025-01-02 13:50:56,430 initializing sim Sim-v1
2025-01-02 13:50:56,432 initializing sim Sim-v1
2025-01-02 13:50:56,436 initializing sim Sim-v1
2025-01-02 13:50:56,440 initializing sim Sim-v1
2025-01-02 13:50:56,443 initializing sim Sim-v1
2025-01-02 13:50:56,444 initializing sim Sim-v1
Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 0
WindowlessContext: Unable to create windowless context
Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 1
WindowlessContext: Unable to create windowless context
Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 0
WindowlessContext: Unable to create windowless context
Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 1
WindowlessContext: Unable to create windowless context
Traceback (most recent call last):
File "run.py", line 113, in
Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 1
WindowlessContext: Unable to create windowless context
main()
File "run.py", line 49, in main
Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 0
WindowlessContext: Unable to create windowless context run_exp(**vars(args))
File "run.py", line 106, in run_exp
trainer.train()
File "/home/ETPNav/vlnce_baselines/ss_trainer_ETP.py", line 451, in train
observation_space, action_space = self._init_envs()
File "/home/ETPNav/vlnce_baselines/ss_trainer_ETP.py", line 168, in _init_envs
auto_reset_done=False
File "/home/ETPNav/vlnce_baselines/common/env_utils.py", line 122, in construct_envs
workers_ignore_signals=workers_ignore_signals,
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 194, in init
read_fn() for read_fn in self._connection_read_fns
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 194, in
read_fn() for read_fn in self._connection_read_fns
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 97, in call
res = self.read_fn()
File "/home/ETPNav/habitat-lab/habitat/utils/pickle5_multiprocessing.py", line 68, in recv
buf = self.recv_bytes()
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Traceback (most recent call last):
File "run.py", line 113, in
main()
File "run.py", line 49, in main
run_exp(**vars(args))
File "run.py", line 106, in run_exp
trainer.train()
File "/home/ETPNav/vlnce_baselines/ss_trainer_ETP.py", line 451, in train
observation_space, action_space = self._init_envs()
File "/home/ETPNav/vlnce_baselines/ss_trainer_ETP.py", line 168, in _init_envs
auto_reset_done=False
File "/home/ETPNav/vlnce_baselines/common/env_utils.py", line 122, in construct_envs
workers_ignore_signals=workers_ignore_signals,
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 194, in init
read_fn() for read_fn in self._connection_read_fns
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 194, in
read_fn() for read_fn in self._connection_read_fns
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 97, in call
res = self.read_fn()
File "/home/ETPNav/habitat-lab/habitat/utils/pickle5_multiprocessing.py", line 68, in recv
buf = self.recv_bytes()
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Exception ignored in: <bound method VectorEnv.del of <habitat.core.vector_env.VectorEnv object at 0x7fec60fd9358>>
Traceback (most recent call last):
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 588, in del
self.close()
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 456, in close
read_fn()
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 97, in call
res = self.read_fn()
File "/home/ETPNav/habitat-lab/habitat/utils/pickle5_multiprocessing.py", line 68, in recv
buf = self.recv_bytes()
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError:
Exception ignored in: <bound method VectorEnv.del of <habitat.core.vector_env.VectorEnv object at 0x7f8dac95c358>>
Traceback (most recent call last):
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 588, in del
self.close()
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 456, in close
read_fn()
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 97, in call
res = self.read_fn()
File "/home/ETPNav/habitat-lab/habitat/utils/pickle5_multiprocessing.py", line 68, in recv
buf = self.recv_bytes()
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError:
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2299562) of binary: /root/miniconda3/envs/vlnce/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/vlnce/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/root/miniconda3/envs/vlnce/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/run.py", line 692, in run
)(*cmd_args)
File "/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================
Root Cause:
[0]:
time: 2025-01-02_13:50:58
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 2299562)
error_file: <N/A>
msg: "Process failed with exitcode 1"
Other Failures:
[1]:
time: 2025-01-02_13:50:58
rank: 1 (local_rank: 1)
exitcode: 1 (pid: 2299563)
error_file: <N/A>
msg: "Process failed with exitcode 1"
The text was updated successfully, but these errors were encountered: