Spawn timeouts on autoscaling – Zero to JupyterHub on Kubernetes

Hey folks, my team is using the Jupyterhub Helm Chart (v0.11.1) with the userScheduler enabled and we’re noticing timeouts due to pods having to wait for large images to finish pulling. This seems to only happen when two notebooks are spawned in rapid succession and when the first of said notebook spawns triggers a scale-out event. It seems like the first spawn evicts a placeholder pod which forces a scaleout event such that the evicted placeholder pod gets scheduled onto a new node. Before that placeholder can pull the image onto the new node, the second spawn takes place and the second pod evicts the same placeholder pod, thus getting scheduled onto the new node which hasn’t finished pulling the singleuser server image.

Instead, I would expect the second pod to evict some “older” (i.e., less-recently-evicted) placeholder pod such that it is scheduled on a node which has long-since pulled the singleuser image. Has anyone else run into this problem before? Is it fixed in some newer version of the application?

Read more here: Source link