-
-
Notifications
You must be signed in to change notification settings - Fork 16.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDP training with multiple gpu using wsl #11519
Comments
@glenn-jocher HI, do you try ddp training with wsl or just linux system and docker? |
@cool112624 hello! Thank you for your question. Distributed Data Parallel (DDP) training works on both Linux and WSL systems, as well as with Docker. Let us know if you have any further questions or concerns! |
@glenn-jocher I run train.py in wsl with two gpu but it show the error code above, can you help me see what is the problem? |
@cool112624 hi, thank you for reaching out. Can you share the error code that you're encountering? It will be easier to identify the root cause and provide a solution if we can take a look at the specific error message. |
@glenn-jocher Thank you for your time, below are my error code. ERROR CODE
|
Hi @glenn-jocher, can you see my error code that I paste above ? |
@cool112624 hello there, Thank you for reaching out. We would be glad to help you with your error code. Please provide us with more details regarding your issue, such as the version of YOLOv5 you are using and the steps you followed before encountering the error. This will help us better understand and address your problem. Looking forward to hearing back from you. Thank you. |
HI @glenn-jocher , thank you for your reply I use the newest version of yolov5 (2023.05.26) My system is I am using wsl from windows cmd and my command line is python -m torch.distributed.run --nproc_per_node 2 train.py --batch 32 --epoch 100 --data coco.yaml --weights yolov5n.pt --device 0,1 the new error code is as shown below ERROR CODE''' WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** train: weights=yolov5n.pt, cfg=, data=coco.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=100, batch_size=32, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest github: skipping check (not a git repository), for updates see https://github.com/ultralytics/yolov5 YOLOv5 🚀 2023-5-26 Python-3.10.6 torch-2.0.1+cu118 CUDA:0 (NVIDIA TITAN X (Pascal), 12288MiB) CUDA:1 (NVIDIA TITAN X (Pascal), 12288MiB)hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0 Dataset not found During handling of the above exception, another exception occurred: Traceback (most recent call last): |
Dear @cool112624, Thank you for providing the details of your system and the error log you encountered. We can see that the error pertains to a missing dataset. Specifically, the error says that the "Dataset is not found, missing paths ['/mnt/c/Users/andy/duckegg_linux/datasets/coco/val2017.txt']" This suggests that there might be an issue with the path or directory of your dataset. We advise that you review and verify the location and accessibility of your dataset. Additionally, please ensure that you have provided the correct path of your dataset in your command line. We hope this information helps you address your issue with YOLOv5. If you have any further concerns or questions, don't hesitate to reach out. Best regards. |
Hi @glenn-jocher , This is the new error code after I solve the missing datasets ERROR CODE''' WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** train: weights=yolov5n.pt, cfg=, data=datasets/coco.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=10, batch_size=2, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest github: skipping check (not a git repository), for updates see https://github.com/ultralytics/yolov5 YOLOv5 🚀 2023-5-26 Python-3.10.6 torch-2.0.1+cu118 CUDA:0 (NVIDIA TITAN X (Pascal), 12288MiB) CUDA:1 (NVIDIA TITAN X (Pascal), 12288MiB)hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0 Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first): terminate called after throwing an instance of 'c10::Error' Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first): ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 11087) of binary: /usr/bin/python3
|
Hi @glenn-jocher, what does illegal memory acess points to ? |
@cool112624 hello, Illegal memory access usually means that the program is attempting to access memory that it is not allowed to access. This can happen for a variety of reasons, such as trying to access a null pointer or trying to access memory that has already been freed. I hope this helps! Let us know if you have any further questions. Thank you. |
@glenn-jocher |
@cool112624 hello, Thank you for reaching out to us. While memory availability is definitely a factor that can impact YOLOv5's performance, there could be other factors at play. You mentioned that your PyTorch and CUDA versions are up to date and running fine with a single GPU, and that's a good starting point. However, other factors, such as the size and complexity of your dataset, the batch size you're running, and even the specific hardware you're using can also play a role in determining performance. I would recommend checking your batch size and dataset to see if reducing the former or simplifying the latter improves performance. Additionally, if possible, testing on different hardware can also provide valuable insights. Let us know if you have any further questions or concerns. Thank you. |
@glenn-jocher Hi, thank you for your reply the batch size I had tried is from 2, 4, 8, 16 and 32 |
Hi, @glenn-jocher, or can I know a environment where you sucessfully ran this ddp multi gpu training so that I can make a replica of your environment and try out. Thank you in advance |
👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help. For additional resources and information, please see the links below:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed! Thank you for your contributions to YOLO 🚀 and Vision AI ⭐ |
@cool112624 Hello, I appreciate your thorough testing. Unfortunately, I cannot provide an environment where I've personally tested DDP multi-GPU training, as our testing and development environments are diverse and may not be reproducible in a generic setting. However, I encourage you to refer to the official YOLOv5 documentation and community forums for successful case studies and potential environment configurations. Collaborating with the YOLO community or reaching out to fellow users who have experience in DDP multi-GPU training could also be beneficial. Should you have any further questions or concerns, please don't hesitate to ask. Thank you. |
Search before asking
Question
HI, I am training with a windows 10 machine running WSL( Windows Subsystem for Linux), but I kept receive illegal memory access error code. Does anyone have sucess experience of running in wsl
I am using two nvidia titan X
ERROR CODE
Additional
No response
The text was updated successfully, but these errors were encountered: