Torchserve throws 400 `DownloadArchiveException` when the process does not have write access to model store folder

Hello.
Recently I installed eks cluster with torchserve following the tutorial github.com/pytorch/serve/tree/master/kubernetes/EKS, but having troubles uploading a motel.

When I try to upload a model via:

curl -X POST  "http://$HOST:8081/models?url=http%3A//54.190.129.247%3A8222/model_ubuntu_2dd0aac04a22d6a0.mar"

curl -X POST  "http://$HOST:8081/models?url=http://54.190.129.247/8222/model_ubuntu_2dd0aac04a22d6a0.mar"

I am getting the following error:

    {
     "code": 400,
     "type": "DownloadArchiveException",
     "message": "Failed to download archive from: http://54.190.129.247:8222/model_ubuntu_2dd0aac04a22d6a0.mar"
   }

Although http://54.190.129.247:8222/model_ubuntu_2dd0aac04a22d6a0.mar is a valid url.

kubectl describe pod -n default torchserve-6d4d5c8c89-zmnp9:


Name:         torchserve-6d4d5c8c89-zmnp9
Namespace:    default
Priority:     0
Node:         ip-192-168-57-45.us-west-2.compute.internal/192.168.57.45
Start Time:   Thu, 26 Aug 2021 13:13:21 -0700
Labels:       app=torchserve
              pod-template-hash=6d4d5c8c89
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Running
IP:           192.168.38.125
IPs:
  IP:           192.168.38.125
Controlled By:  ReplicaSet/torchserve-6d4d5c8c89
Containers:
  torchserve:
    Container ID:  docker://a64f5ef418c569249c1c05fe3056d808c2e22b79c203aed05017580bea132cc0
    Image:         pytorch/torchserve:latest
    Image ID:      docker-pullable://pytorch/torchserve@sha256:3c290c60cb89bca38fbf1d6a36ea99554b3dbb9d32cb89ed434828c5b3fd2c73
    Ports:         8080/TCP, 8081/TCP, 8082/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Args:
      torchserve
      --start
      --model-store
      /home/model-server/shared/model-store/
      --ts-config
      /home/model-server/shared/config/config.properties
    State:          Running
      Started:      Thu, 26 Aug 2021 13:13:22 -0700
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:             1
      memory:          4Gi
      nvidia.com/gpu:  0
    Requests:
      cpu:             1
      memory:          1Gi
      nvidia.com/gpu:  0
    Environment:       <none>
    Mounts:
      /home/model-server/shared/ from persistent-storage (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-z8vb9 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  persistent-storage:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  model-store-claim
    ReadOnly:   false
  default-token-z8vb9:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-z8vb9
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  17m   default-scheduler  Successfully assigned default/torchserve-6d4d5c8c89-zmnp9 to ip-192-168-57-45.us-west-2.compute.internal
  Normal  Pulled     17m   kubelet            Container image "pytorch/torchserve:latest" already present on machine
  Normal  Created    17m   kubelet            Created container torchserve
  Normal  Started    17m   kubelet            Started container torchserve

config.properties:

inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
number_of_netty_threads=32
job_queue_size=1000
model_store=/home/model-server/model-store

Output of the access.log

2021-08-30 01:28:19,091 [INFO ] epollEventLoopGroup-3-12 ACCESS_LOG - /192.168.24.51:2472 "POST /models?url=http://34.219.222.97:8221/model_ubuntu_09888c953c68c1fa.mar%26model_name=aivanou HTTP/1.1" 400 6
2021-08-30 01:28:19,091 [INFO ] epollEventLoopGroup-3-12 TS_METRICS - Requests4XX.Count:1|#Level:Host|#hostname:torchserve-69494c8469-8f8z8,timestamp:null
2021-08-30 01:28:20,568 [INFO ] epollEventLoopGroup-3-13 ACCESS_LOG - /192.168.32.146:61380 "POST /models?url=http://34.219.222.97:8221/model_ubuntu_09888c953c68c1fa.mar&model_name=aivanou HTTP/1.1" 400 7
2021-08-30 01:28:20,568 [INFO ] epollEventLoopGroup-3-13 TS_METRICS - Requests4XX.Count:1|#Level:Host|#hostname:torchserve-69494c8469-8f8z8,timestamp:null

ts log output:

2021-08-30 03:00:50,425 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2021-08-30 03:00:50,609 [INFO ] main org.pytorch.serve.ModelServer -
Torchserve version: 0.4.2
TS Home: /usr/local/lib/python3.6/dist-packages
Current directory: /home/model-server
Temp directory: /home/model-server/tmp
Number of GPUs: 0
Number of CPUs: 2
Max heap size: 2048 M
Python executable: /usr/bin/python3
Config file: /home/model-server/shared/config/config.properties
Inference address: http://0.0.0.0:8080
Management address: http://0.0.0.0:8081
Metrics address: http://127.0.0.1:8082
Model Store: /home/model-server/shared/model-store
Initial Models: N/A
Log dir: /home/model-server/logs
Metrics dir: /home/model-server/logs
Netty threads: 32
Netty client threads: 0
Default workers per model: 2
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: false
Metrics report format: prometheus
Enable metrics API: true
Workflow Store: /home/model-server/shared/model-store
Model config: N/A
2021-08-30 03:00:50,618 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2021-08-30 03:00:50,660 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2021-08-30 03:00:50,740 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080
2021-08-30 03:00:50,740 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2021-08-30 03:00:50,742 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://0.0.0.0:8081
2021-08-30 03:00:50,742 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2021-08-30 03:00:50,743 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
2021-08-30 03:03:28,587 [DEBUG] epollEventLoopGroup-3-18 org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model mnist
2021-08-30 03:03:28,588 [DEBUG] epollEventLoopGroup-3-18 org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model mnist
2021-08-30 03:03:28,588 [INFO ] epollEventLoopGroup-3-18 org.pytorch.serve.wlm.ModelManager - Model mnist loaded.
2021-08-30 03:06:44,068 [DEBUG] epollEventLoopGroup-3-13 org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1 for model tiny_image_net_aivanou_8df333374e4d115f
2021-08-30 03:06:44,069 [DEBUG] epollEventLoopGroup-3-13 org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1 for model tiny_image_net_aivanou_8df333374e4d115f
2021-08-30 03:06:44,069 [INFO ] epollEventLoopGroup-3-13 org.pytorch.serve.wlm.ModelManager - Model tiny_image_net_aivanou_8df333374e4d115f loaded.

Read more here: Source link