Debugging failing deployments

Quick introduction to deployments

  • Valohai deployments are primarily used for online inference

  • All deployments are served from a Kubernetes cluster

  • Each deployment can contain multiple versions, and each version can contain multiple HTTP endpoints

  • For more details, see Core Concepts: Deployments

The most straightforward way of debugging is using the endpoint logs. If the error was in the code runtime, the logs should show a stack trace or an error message but this depends on the programming language and libraries used.

You can also print additional details yourself to keep track how the code is progressing.

But sometimes logging doesn’t contain enough information about the surrounding infrastructure.

Valohai deployment endpoints report their internal state under the Cluster Status tab. These more advanced details can give hints why the endpoint might be failing, for example if the previous container state reads OOMKilled, the endpoint was taking more memory than it requested and it was killed by the Kubernetes cluster.

It’s recommended to increase your endpoint replica count on the version endpoint listing if you want to make your endpoint less prone to become unavailable. This effectively means there will be more web servers running and one of them crashing doesn’t make the whole endpoint unresponsive.

Cluster status a handy way to see more technical details of the endpoint.