I have a pytorch based 3 layer deep model that I use for classifying text inputs. Its based on transformers. Now I have fixed input X containing 10 text inputs in a batch. I call the inference function for X repeatedly 25 times. I note the timing. I see periodic spikes in inference. Why is this happening?
Note: I use time.time() to measure delta timings
Inference code (infer part in graph):
input_ids = torch.from_numpy(input_ids).to(self.device)
attention_masks = torch.from_numpy(attention_masks).to(self.device)
b_input_task = torch.full((input_ids.shape[0],), -1, dtype=torch.int32).to(self.device)
with torch.no_grad():
result = self.opt_model((input_ids, attention_masks, b_input_task))
logits = result.detach().cpu().numpy()
I notice that the line result = self.opt_model… behaves non-deterministically in terms of timing.
I have already done self.opt_model.eval() in init function of my class that is initialized at the very beginning. I am running this on V100 GPU.
Graph with batch size 10 (10 inputs, repeatedly bombarded):
Graph with batch size 1 (1 input, repeatedly bombarded):
If I add a 5 second delay between the repeated inference calls, the timings change (nearing to what I expect):
What is this behaviour?
Read more here: Source link