SageMaker PyTorch Endpoint: NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND) | AWS re:Post

Hi, I wanted to raise awareness on this (please direct me if this is not the place to do so). I created a SageMaker endpoint and pass an image through the endpoint. It causes the error I’ve attached below. I’ve attached the CloudWatch image which indicates a function is missing in the pynvml library. I created a requirements.txt which installs the nvgpu and pynvml, but the log displayed that they already exist. The relevant topic I could find is here: github.com/pytorch/serve/issues/1813. For comprehension sake, I checked the logs and the Torchserve version is 0.7.1. The last activity on that github was last year so I was curious if anyone has found a solution. I appreciate any help!

I created an endpoint in SageMaker as such:

from sagemaker.pytorch.model import PyTorchModel

pytorch_model = PyTorchModel(
    model_data= model_bucket,
    role=role,
    entry_point='inference.py',
    source_dir='code',
    py_version="py39",
    framework_version="1.13",
)

predictor = pytorch_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g4dn.xlarge",
)

I then call the endpoint to predict:

# Load and encode the image
import base64

with open('zebra.jpg', 'rb') as img:
    image = img.read()

image_base64 = base64.b64encode(image).decode('utf-8')

response = predictor.predict(image_base64, initial_args={'ContentType': 'application/x-image'})

The error message I receive specifically is the following which directs me to CloudWatch.

An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from primary and could not load the entire response body.

CloudWatch Error Message

Read more here: Source link