Skip to content
This repository was archived by the owner on Jan 26, 2022. It is now read-only.

Conversation

@Constannnnnt
Copy link

@Constannnnnt Constannnnnt commented Jul 6, 2018

Description: I use 3 GPUs to train the network and interrupt at some point before the final step, which means I only save the checkpoint but not config. Then, I try to test the model, which unexpectedly failed and the error message is start = subinds[i][0], list index out of range.

Issue: I think at the line 64, instead of writing gpu_inds = range(cfg.NUM_GPUS), I think it is much more reasonable to write gpu_inds = range(NUM_GPUS). Let me explain it.

After import the yaml and config file in subprocess.py, cfg.NUM_GPUs is 8 instead of 3 (well, in train_net_step, there is a statement which assigns cfg.NUM_GPUs = torch.cuda.device_count(), so it does not crash), and NUM_GPUs = torch.cuda.device_count() = 3 in my case, and it turns out that at line 56, the size of subins is 3.

I choose to let cuda see all my GPUs, Later, at line 64, if gpu_inds = range(cfg.NUM_GPUS) is used, the size of gpu_indx is 8, which then will crash at line 68. Therefore, at line 64, gpus_inds = range(NUM_GPUs) is much more reasonable.

Please check and see if my solution is correct or not. Thanks.

@ternaus
Copy link

ternaus commented Sep 30, 2018

👍

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants