Issue with distributed batches over multiple GPUs #416
Unanswered
ClaasBeger
asked this question in
Q&A
Replies: 1 comment 3 replies
-
I have not found any other issue to cause DDP training problems. You can try to use the latest version of SpikingJelly and check if it works well. |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
I am currently implementing a large model employing SpikingJelly MultiStepLIFNodes with cupy backend, and am trying to distribute the training process over multiple GPUs using torch.nn.DataParallel. Unfortunately, the training seems to be stuck, and does not produce any results whatsoever (no exception is thrown either). The program runs normally without the LIFs. I have found there was a similar issue in the past related to cupy/cupy#6569 where the cupy device was always set to cuda:0, which has been marked as fixed in spikingjelly version 10. Are there still any known issue when using the cupy backend with DataParallel, or is it expected to work fine?
Thank you in advance!
Beta Was this translation helpful? Give feedback.
All reactions