I am currently trying to make a translation model with Trnasformer model through PyTorch. Since I have 2 GPUs (2080ti x 2) available for training, I want to train the model through multi-gpu. Currently, the gpu is assigned to 0 and 1 respectively. The way I use multi-gpu is to put nn.DataParallel on the model object.
Declared encoder, decoder, and model objects
enc = Encoder(INPUT_DIM, HIDDEN_DIM, ENC_LAYERS, ENC_HEADS, ENC_PF_DIM, ENC_DROPOUT, device)
dec = Decoder(OUTPUT_DIM, HIDDEN_DIM, DEC_LAYERS, DEC_HEADS, DEC_PF_DIM, DEC_DROPOUT, device)
model = nn.DataParallel(Transformer(enc, dec, SRC_PAD_IDX, TRG_PAD_IDX, device).to(device))
Since the above method has been used in various models of pytorch in the past, I thought it would work without any problems.
However, if I go through the actual training in this way, I get the following error:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In [22], line 87
84 for epoch in range(N_EPOCHS):
85 start_time = time.time() # 시작 시간 기록
---> 87 train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
88 valid_loss = evaluate(model, validation_iterator, criterion)
90 end_time = time.time() # 종료 시간 기록
Cell In [18], line 15, in train(model, iterator, optimizer, criterion, clip)
11 optimizer.zero_grad()
13 # 출력 단어의 마지막 인덱스(<eos>)는 제외
14 # 입력을 할 때는 <sos>부터 시작하도록 처리
---> 15 output, _ = model(src, trg[:,:-1])
17 # output: [배치 크기, trg_len - 1, output_dim]
18 # trg: [배치 크기, trg_len]
20 output_dim = output.shape[-1]
File ~/anaconda3/envs/jki_pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
1126 # If we don't have any hooks, we want to skip the rest of the logic in
1127 # this function, and just call forward.
1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1129 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130 return forward_call(*input, **kwargs)
1131 # Do not call functions when jit is used
...
return forward_call(*input, **kwargs)
File "/tmp/ipykernel_203464/284771533.py", line 31, in forward
src = self.dropout((self.tok_embedding(src) * self.scale) + self.pos_embedding(pos))
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!
As you can see from the above error, the part where the error occurred was line 15 of the train() function. The train function is defined as:
def train(model, iterator, optimizer, criterion, clip):
model.train() # 학습 모드
epoch_loss = 0
for i, batch in enumerate(tqdm(iterator, ascii=True, ncols=100)):
src = batch.src
trg = batch.trg
optimizer.zero_grad()
output, _ = model(src, trg[:,:-1])
output_dim = output.shape[-1]
output = output.contiguous().view(-1, output_dim)
trg = trg[:,1:].contiguous().view(-1)
loss = criterion(output, trg)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
optimizer.step()
epoch_loss += loss.item()
return epoch_loss / len(iterator)
To solve this error, I tried nn.DataParallel in the encoder decoder, but the same error is returned. I also tried the model in the train() function.
Also, I tracked the variables to see where the Tensors are assigned to which cuda, but only cuda, not cuda:0, cuda:1. I’ve tried so many different things and still can’t find this error. If anyone knows please help me
Read more here: Source link