Troubleshoot Endpoints

Diagnose and fix issues with deployment endpoints using logs, cluster status, and local testing.

Test locally first

Before deploying to Valohai, run your endpoint locally to catch issues early.

Benefits:

  • Immediate feedback on code errors

  • Easier debugging with local tools

  • Faster iteration without waiting for builds

Test your FastAPI endpoint:

pip install -r requirements-deployment.txt
uvicorn predict:app --reload

Visit http://localhost:8000/docs to test your endpoints interactively.

Check endpoint logs

The most direct way to debug runtime issues:

  1. Open your deployment

  2. Select the failing version

  3. Click the Log tab

What to look for:

  • Python stack traces

  • Import errors

  • Model loading failures

  • Request/response errors

Add custom logging

Enhance debugging by logging key events in your code:

import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@app.post("/predict")
def predict(data: dict):
    logger.info(f"Received prediction request with {len(data)} features")
    
    try:
        result = model.predict(data)
        logger.info(f"Prediction successful: {result}")
        return result
    except Exception as e:
        logger.error(f"Prediction failed: {str(e)}")
        raise

These logs appear in the endpoint logs, making it easier to trace execution flow.

Check cluster status

For infrastructure-level issues, check the Cluster Status tab:

  1. Navigate to your deployment version

  2. Click Cluster Status

  3. Review pod status and events

Common issues:

OOMKilled (Out of Memory) Your endpoint consumed more memory than allocated, and Kubernetes terminated it.

Solution: Increase memory allocation in valohai.yaml:

- endpoint:
    name: predict
    memory_limit: 2048  # Increase from default

ImagePullBackOff Kubernetes can't pull your Docker image.

Solution: Verify the base image exists and is accessible.

CrashLoopBackOff Your endpoint starts but immediately crashes.

Solution: Check logs for startup errors (missing files, import failures).

Common deployment issues

Syntax errors in Python code:

  • Test locally before deploying

  • Check build logs for syntax errors during image creation

Missing dependencies:

  • Verify all packages are in requirements-deployment.txt

  • Pin versions to avoid surprises: tensorflow==2.5.1

Model file not found:

  • Confirm the file path in valohai.yaml matches your code

  • Check that you selected model files when creating the version

Uvicorn not found:

  • Install it in requirements-deployment.txt

  • Update server-command to use installed path: ~/.local/bin/uvicorn

Deployment stuck in "Pending"

A deployment stays "Pending" until successfully deployed and ready to accept requests.

What "Pending" means: Valohai is building your Docker image, deploying to Kubernetes, or waiting for health checks to pass. This usually takes 2-5 minutes.

If it's stuck for more than 10 minutes, check:

  1. Endpoint logs for runtime errors:

    • Python syntax errors

    • Missing dependencies

    • Model loading failures

  2. Cluster Status for infrastructure issues:

    • OOMKilled (out of memory)

    • ImagePullBackOff (can't pull Docker image)

    • CrashLoopBackOff (endpoint crashes on startup)

  3. Build logs (if available):

    • Dependency installation failures

    • Base image not found

Common fixes:

  • Increase memory allocation if you see OOMKilled

  • Verify all packages are in requirements-deployment.txt

  • Confirm model files aren't too large for allocated resources

  • Test your endpoint locally before deploying

Last updated

Was this helpful?