训练solov2_d2的时候报错

JimXu1989

@刘看山
您好，我安装了cuda 11.3 pytorch 1.10，训练solov2_d2的时候报这个错误：

[04/29 13:43:52 d2.data.common]: Serialized dataset takes 0.71 MiB
WARNING [04/29 13:43:52 d2.solver.build]: SOLVER.STEPS contains values larger than SOLVER.MAX_ITER. These values will be ignored.
/usr/local/lib/python3.8/dist-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
setattr(self, word, getattr(machar, word).flat[0])
/usr/local/lib/python3.8/dist-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
return self._float_to_str(self.smallest_subnormal)
/usr/local/lib/python3.8/dist-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
setattr(self, word, getattr(machar, word).flat[0])
/usr/local/lib/python3.8/dist-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
return self._float_to_str(self.smallest_subnormal)
[04/29 13:43:54 fvcore.common.checkpoint]: [Checkpointer] Loading from detectron2://ImageNetPretrained/MSRA/R-101.pkl ...
[04/29 13:43:54 d2.checkpoint.c2_model_loading]: Renaming Caffe2 weights ......
[04/29 13:43:54 d2.checkpoint.c2_model_loading]: Following weights matched with submodule backbone.bottom_up:

Names in Model	Names in Checkpoint	Shapes
res2.0.conv1.*	res2_0_branch2a_{bn_*,w}	(64,) (64,) (64,) (64,) (64,64,1,1)
res2.0.conv2.*	res2_0_branch2b_{bn_*,w}	(64,) (64,) (64,) (64,) (64,64,3,3)
res2.0.conv3.*	res2_0_branch2c_{bn_*,w}	(256,) (256,) (256,) (256,) (256,64,1,1)
res2.0.shortcut.*	res2_0_branch1_{bn_*,w}	(256,) (256,) (256,) (256,) (256,64,1,1)
res2.1.conv1.*	res2_1_branch2a_{bn_*,w}	(64,) (64,) (64,) (64,) (64,256,1,1)
res2.1.conv2.*	res2_1_branch2b_{bn_*,w}	(64,) (64,) (64,) (64,) (64,64,3,3)
res2.1.conv3.*	res2_1_branch2c_{bn_*,w}	(256,) (256,) (256,) (256,) (256,64,1,1)
res2.2.conv1.*	res2_2_branch2a_{bn_*,w}	(64,) (64,) (64,) (64,) (64,256,1,1)
res2.2.conv2.*	res2_2_branch2b_{bn_*,w}	(64,) (64,) (64,) (64,) (64,64,3,3)
res2.2.conv3.*	res2_2_branch2c_{bn_*,w}	(256,) (256,) (256,) (256,) (256,64,1,1)
res3.0.conv1.*	res3_0_branch2a_{bn_*,w}	(128,) (128,) (128,) (128,) (128,256,1,1)
res3.0.conv2.*	res3_0_branch2b_{bn_*,w}	(128,) (128,) (128,) (128,) (128,128,3,3)
res3.0.conv3.*	res3_0_branch2c_{bn_*,w}	(512,) (512,) (512,) (512,) (512,128,1,1)
res3.0.shortcut.*	res3_0_branch1_{bn_*,w}	(512,) (512,) (512,) (512,) (512,256,1,1)
res3.1.conv1.*	res3_1_branch2a_{bn_*,w}	(128,) (128,) (128,) (128,) (128,512,1,1)
res3.1.conv2.*	res3_1_branch2b_{bn_*,w}	(128,) (128,) (128,) (128,) (128,128,3,3)
res3.1.conv3.*	res3_1_branch2c_{bn_*,w}	(512,) (512,) (512,) (512,) (512,128,1,1)
res3.2.conv1.*	res3_2_branch2a_{bn_*,w}	(128,) (128,) (128,) (128,) (128,512,1,1)
res3.2.conv2.*	res3_2_branch2b_{bn_*,w}	(128,) (128,) (128,) (128,) (128,128,3,3)
res3.2.conv3.*	res3_2_branch2c_{bn_*,w}	(512,) (512,) (512,) (512,) (512,128,1,1)
res3.3.conv1.*	res3_3_branch2a_{bn_*,w}	(128,) (128,) (128,) (128,) (128,512,1,1)
res3.3.conv2.*	res3_3_branch2b_{bn_*,w}	(128,) (128,) (128,) (128,) (128,128,3,3)
res3.3.conv3.*	res3_3_branch2c_{bn_*,w}	(512,) (512,) (512,) (512,) (512,128,1,1)
res4.0.conv1.*	res4_0_branch2a_{bn_*,w}	(256,) (256,) (256,) (256,) (256,512,1,1)
res4.0.conv2.*	res4_0_branch2b_{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.0.conv3.*	res4_0_branch2c_{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.0.shortcut.*	res4_0_branch1_{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,512,1,1)
res4.1.conv1.*	res4_1_branch2a_{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.1.conv2.*	res4_1_branch2b_{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.1.conv3.*	res4_1_branch2c_{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.10.conv1.*	res4_10_branch2a_{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.10.conv2.*	res4_10_branch2b_{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.10.conv3.*	res4_10_branch2c_{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.11.conv1.*	res4_11_branch2a_{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.11.conv2.*	res4_11_branch2b_{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.11.conv3.*	res4_11_branch2c_{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.12.conv1.*	res4_12_branch2a_{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.12.conv2.*	res4_12_branch2b_{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.12.conv3.*	res4_12_branch2c_{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.13.conv1.*	res4_13_branch2a_{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.13.conv2.*	res4_13_branch2b_{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.13.conv3.*	res4_13_branch2c_{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.14.conv1.*	res4_14_branch2a_{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.14.conv2.*	res4_14_branch2b_{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.14.conv3.*	res4_14_branch2c_{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.15.conv1.*	res4_15_branch2a_{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.15.conv2.*	res4_15_branch2b_{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.15.conv3.*	res4_15_branch2c_{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.16.conv1.*	res4_16_branch2a_{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.16.conv2.*	res4_16_branch2b_{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.16.conv3.*	res4_16_branch2c_{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.17.conv1.*	res4_17_branch2a_{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.17.conv2.*	res4_17_branch2b_{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.17.conv3.*	res4_17_branch2c_{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.18.conv1.*	res4_18_branch2a_{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.18.conv2.*	res4_18_branch2b_{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.18.conv3.*	res4_18_branch2c_{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.19.conv1.*	res4_19_branch2a_{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.19.conv2.*	res4_19_branch2b_{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.19.conv3.*	res4_19_branch2c_{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.2.conv1.*	res4_2_branch2a_{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.2.conv2.*	res4_2_branch2b_{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.2.conv3.*	res4_2_branch2c_{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.20.conv1.*	res4_20_branch2a_{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.20.conv2.*	res4_20_branch2b_{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.20.conv3.*	res4_20_branch2c_{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.21.conv1.*	res4_21_branch2a_{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.21.conv2.*	res4_21_branch2b_{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.21.conv3.*	res4_21_branch2c_{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.22.conv1.*	res4_22_branch2a_{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.22.conv2.*	res4_22_branch2b_{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.22.conv3.*	res4_22_branch2c_{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.3.conv1.*	res4_3_branch2a_{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.3.conv2.*	res4_3_branch2b_{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.3.conv3.*	res4_3_branch2c_{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.4.conv1.*	res4_4_branch2a_{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.4.conv2.*	res4_4_branch2b_{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.4.conv3.*	res4_4_branch2c_{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.5.conv1.*	res4_5_branch2a_{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.5.conv2.*	res4_5_branch2b_{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.5.conv3.*	res4_5_branch2c_{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.6.conv1.*	res4_6_branch2a_{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.6.conv2.*	res4_6_branch2b_{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.6.conv3.*	res4_6_branch2c_{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.7.conv1.*	res4_7_branch2a_{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.7.conv2.*	res4_7_branch2b_{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.7.conv3.*	res4_7_branch2c_{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.8.conv1.*	res4_8_branch2a_{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.8.conv2.*	res4_8_branch2b_{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.8.conv3.*	res4_8_branch2c_{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.9.conv1.*	res4_9_branch2a_{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.9.conv2.*	res4_9_branch2b_{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.9.conv3.*	res4_9_branch2c_{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res5.0.conv1.*	res5_0_branch2a_{bn_*,w}	(512,) (512,) (512,) (512,) (512,1024,1,1)
res5.0.conv2.*	res5_0_branch2b_{bn_*,w}	(512,) (512,) (512,) (512,) (512,512,3,3)
res5.0.conv3.*	res5_0_branch2c_{bn_*,w}	(2048,) (2048,) (2048,) (2048,) (2048,512,1,1)
res5.0.shortcut.*	res5_0_branch1_{bn_*,w}	(2048,) (2048,) (2048,) (2048,) (2048,1024,1,1)
res5.1.conv1.*	res5_1_branch2a_{bn_*,w}	(512,) (512,) (512,) (512,) (512,2048,1,1)
res5.1.conv2.*	res5_1_branch2b_{bn_*,w}	(512,) (512,) (512,) (512,) (512,512,3,3)
res5.1.conv3.*	res5_1_branch2c_{bn_*,w}	(2048,) (2048,) (2048,) (2048,) (2048,512,1,1)
res5.2.conv1.*	res5_2_branch2a_{bn_*,w}	(512,) (512,) (512,) (512,) (512,2048,1,1)
res5.2.conv2.*	res5_2_branch2b_{bn_*,w}	(512,) (512,) (512,) (512,) (512,512,3,3)
res5.2.conv3.*	res5_2_branch2c_{bn_*,w}	(2048,) (2048,) (2048,) (2048,) (2048,512,1,1)
stem.conv1.norm.*	res_conv1_bn_*	(64,) (64,) (64,) (64,)
stem.conv1.weight	conv1_w	(64, 3, 7, 7)

WARNING [04/29 13:43:54 fvcore.common.checkpoint]: Some model parameters or buffers are not found in the checkpoint:
backbone.fpn_lateral2.{bias, weight}
backbone.fpn_lateral3.{bias, weight}
backbone.fpn_lateral4.{bias, weight}
backbone.fpn_lateral5.{bias, weight}
backbone.fpn_output2.{bias, weight}
backbone.fpn_output3.{bias, weight}
backbone.fpn_output4.{bias, weight}
backbone.fpn_output5.{bias, weight}
ins_head.cate_pred.{bias, weight}
ins_head.cate_tower.0.weight
ins_head.cate_tower.2.weight
ins_head.kernel_pred.{bias, weight}
ins_head.kernel_tower.0.weight
ins_head.kernel_tower.2.weight
mask_head.conv_pred.0.weight
mask_head.conv_pred.1.{bias, weight}
mask_head.convs_all_levels.0.conv0.0.weight
mask_head.convs_all_levels.1.conv0.0.weight
mask_head.convs_all_levels.2.conv0.0.weight
mask_head.convs_all_levels.2.conv1.0.weight
mask_head.convs_all_levels.3.conv0.0.weight
mask_head.convs_all_levels.3.conv1.0.weight
mask_head.convs_all_levels.3.conv2.0.weight
WARNING [04/29 13:43:54 fvcore.common.checkpoint]: The checkpoint state_dict contains keys that are not used by the model:
fc1000.{bias, weight}
[04/29 13:43:54 adet.trainer]: Starting training from iteration 0
/usr/local/lib/python3.8/dist-packages/detectron2/structures/image_list.py:88: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
max_size = (max_size + (stride - 1)) // stride * stride
/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py:3631: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details.
warnings.warn(
/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py:3679: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details.
warnings.warn(
/usr/local/lib/python3.8/dist-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2157.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
/home/xss/Projects/wood/solov2_d2/adet/modeling/solov2/solov2.py:279: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
(center_w / upsampled_size[1]) // (1. / num_grid))
/home/xss/Projects/wood/solov2_d2/adet/modeling/solov2/solov2.py:281: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
(center_h / upsampled_size[0]) // (1. / num_grid))
/home/xss/Projects/wood/solov2_d2/adet/modeling/solov2/solov2.py:285: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
0, int(((center_h - half_h) / upsampled_size[0]) // (1. / num_grid)))
/home/xss/Projects/wood/solov2_d2/adet/modeling/solov2/solov2.py:287: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
num_grid - 1, int(((center_h + half_h) / upsampled_size[0]) // (1. / num_grid)))
/home/xss/Projects/wood/solov2_d2/adet/modeling/solov2/solov2.py:289: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
0, int(((center_w - half_w) / upsampled_size[1]) // (1. / num_grid)))
/home/xss/Projects/wood/solov2_d2/adet/modeling/solov2/solov2.py:291: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
num_grid - 1, int(((center_w + half_w) / upsampled_size[1]) // (1. / num_grid)))
Traceback (most recent call last):
File "/home/xss/Projects/wood/solov2_d2/tools/train_wood.py", line 250, in
launch(
File "/usr/local/lib/python3.8/dist-packages/detectron2/engine/launch.py", line 82, in launch
main_func(*args)
File "/home/xss/Projects/wood/solov2_d2/tools/train_wood.py", line 244, in main
return trainer.train()
File "/home/xss/Projects/wood/solov2_d2/tools/train_wood.py", line 124, in train
self.train_loop(self.start_iter, self.max_iter)
File "/home/xss/Projects/wood/solov2_d2/tools/train_wood.py", line 113, in train_loop
self.run_step()
File "/usr/local/lib/python3.8/dist-packages/detectron2/engine/defaults.py", line 494, in run_step
self._trainer.run_step()
File "/usr/local/lib/python3.8/dist-packages/detectron2/engine/train_loop.py", line 285, in run_step
losses.backward()
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/init.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [4, 128, 184, 232]], which is output 0 of ReluBackward0, is at version 3; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Process finished with exit code 1

SiChuanJay

@JimXu1989 请问这个问题是怎么解决的？