Issue with distributed batches over multiple GPUs #416

ClaasBeger · 2023-09-01T18:36:55Z

ClaasBeger
Sep 1, 2023

Hello,

I am currently implementing a large model employing SpikingJelly MultiStepLIFNodes with cupy backend, and am trying to distribute the training process over multiple GPUs using torch.nn.DataParallel. Unfortunately, the training seems to be stuck, and does not produce any results whatsoever (no exception is thrown either). The program runs normally without the LIFs. I have found there was a similar issue in the past related to cupy/cupy#6569 where the cupy device was always set to cuda:0, which has been marked as fixed in spikingjelly version 10. Are there still any known issue when using the cupy backend with DataParallel, or is it expected to work fine?
Thank you in advance!

fangwei123456 · 2023-09-02T02:29:15Z

fangwei123456
Sep 2, 2023
Maintainer

Are there still any known issue when using the cupy backend with DataParallel, or is it expected to work fine?

I have not found any other issue to cause DDP training problems. You can try to use the latest version of SpikingJelly and check if it works well.

3 replies

ClaasBeger Sep 2, 2023
Author

Does it only work with Distributed Data Parallel, or is DataParallel (https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html) also supported ?

fangwei123456 Sep 2, 2023
Maintainer

I have not tested DataParallel. It works well in Distributed Data Parallel.

ClaasBeger Sep 2, 2023
Author

Alright, then I might try to transition to distributed data parallel in the future. Thanks for the help👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with distributed batches over multiple GPUs #416

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Issue with distributed batches over multiple GPUs #416

ClaasBeger Sep 1, 2023

Replies: 1 comment · 3 replies

fangwei123456 Sep 2, 2023 Maintainer

ClaasBeger Sep 2, 2023 Author

fangwei123456 Sep 2, 2023 Maintainer

ClaasBeger Sep 2, 2023 Author

ClaasBeger
Sep 1, 2023

Replies: 1 comment 3 replies

fangwei123456
Sep 2, 2023
Maintainer

ClaasBeger Sep 2, 2023
Author

fangwei123456 Sep 2, 2023
Maintainer

ClaasBeger Sep 2, 2023
Author