-
-
Notifications
You must be signed in to change notification settings - Fork 16.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run tutorial: RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM #1469
Comments
Hello @nguyen14ck, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution. If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you. If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available. For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at [email protected]. RequirementsPython 3.8 or later with all requirements.txt dependencies installed, including $ pip install -r requirements.txt EnvironmentsYOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
StatusIf this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit. |
@nguyen14ck install Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. |
Thanks, @glenn-jocher.
|
@nguyen14ck I'm not sure exactly what the problem may be. We've had some problems with Anaconda in the past, so one thing I would recommend is for you to simply create a new virtual Python 3.8 environment (venv), clone the latest repo (code changes daily), and Other than that it may be an issue with your drivers. You can always try the docker container as well, as it should completely remove all environment problems. |
Thanks, @glenn-jocher
|
@nguyen14ck sure. We don't have resources to help people with their local environments, this is the reason we offer the four validated environments. I would recommend you start from one of these: EnvironmentsYOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
StatusIf this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I meet the same bug if i used two nvidia graph card(gtx2070 and gtx1070ti)
If I use gtx2070 or gtx1070ti , the program run normally!My Env:
|
@blakeliu best practices is to only run Multi-GPU with identical cards. |
@glenn-jocher Thank your advises. |
For your information: It seems that it was bad to use different type of GPU. In my case, I used two GPUs : |
@tetsu-kikuchi interesting, thanks for the feedback! I've been thinking we should default to --device 0 rather than use all devices by default. Do you think this is a good idea? |
@glenn-jocher Thank you for your response. Using multi-GPUs sometimes causes unexpected errors, and error messages related to GPU are often hard to find out the reason of the error. So, I think setting --device 0 as a default will be convenient especially for beginners (including me). |
TODO: Device 0 default rather than all available devices default. |
Additional information: I paste below the error message when I set --device 0 or --device 1. I slightly customized the yolov5 code for my purpose, only for miscellaneous things mainly in utils/dataset.py.
The GPU information:
|
@tetsu-kikuchi since this error originates in torch you should probably raise your issue in the pytorch repository. |
@tetsu-kikuchi also, your YOLOv5 code is very out of date. To update:
|
Thanks for your navigation. |
👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs. Access additional YOLOv5 🚀 resources:
Access additional Ultralytics ⚡ resources:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed! Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐! |
TODO removed as original issue is now resolved. YOLOv5 training defaults to device 0 if CUDA is available, with multiple CUDA devices or CPU commands available via the python train.py --device 0,1,2,3
python train.py --device cpu |
The issue #185 was closed.
So I open this
🐛 Bug
...
Starting training for 3 epochs...
0%| | 0/8 [00:02<?, ?it/s]
Traceback (most recent call last):
File "train.py", line 490, in
train(hyp, opt, device, tb_writer, wandb)
File "train.py", line 292, in train
scaler.scale(loss).backward()
File "/home/npnguyen/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 185, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/npnguyen/anaconda3/lib/python3.6/site-packages/torch/autograd/init.py", line 127, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
Exception raised from operator() at /opt/conda/conda-bld/pytorch_1595629416375/work/aten/src/ATen/native/cudnn/Conv.cpp:1141 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f9ff06da77d in
To Reproduce (REQUIRED)
Output:
Expected behavior
Fusing layers...
Model Summary: 484 layers, 88922205 parameters, 0 gradients
Scanning labels ../coco/labels/val2017.cache (4952 found, 0 missing, 48 empty, 0 duplicate, for 5000 images): 5000it [00:00, 14785.71it/s]
Class Images Targets P R [email protected] [email protected]:.95: 100% 157/157 [01:30<00:00, 1.74it/s]
all 5e+03 3.63e+04 0.409 0.754 0.672 0.484
Speed: 5.9/2.1/7.9 ms inference/NMS/total per 640x640 image at batch-size 32
Evaluating pycocotools mAP... saving runs/test/exp/yolov5x_predictions.json...
loading annotations into memory...
Done (t=0.43s)
Environment
If applicable, add screenshots to help explain your problem.
Additional Information
Setup complete. Using torch 1.6.0 _CudaDeviceProperties(name='Quadro RTX 5000', major=7, minor=5, total_memory=16117MB, multi_processor_count=48)
The text was updated successfully, but these errors were encountered: