Troubleshoot Endpoints

Diagnose and fix issues with deployment endpoints using logs, cluster status, and local testing.

Test locally first

Before deploying to Valohai, run your endpoint locally to catch issues early.

Benefits:

Immediate feedback on code errors
Easier debugging with local tools
Faster iteration without waiting for builds

Test your FastAPI endpoint:

pip install -r requirements-deployment.txt
uvicorn predict:app --reload

Visit http://localhost:8000/docs to test your endpoints interactively.

Check endpoint logs

The most direct way to debug runtime issues:

Open your deployment
Select the failing version
Click the Log tab

What to look for:

Python stack traces
Import errors
Model loading failures
Request/response errors

Add custom logging

Enhance debugging by logging key events in your code:

import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@app.post("/predict")
def predict(data: dict):
    logger.info(f"Received prediction request with {len(data)} features")
    
    try:
        result = model.predict(data)
        logger.info(f"Prediction successful: {result}")
        return result
    except Exception as e:
        logger.error(f"Prediction failed: {str(e)}")
        raise

These logs appear in the endpoint logs, making it easier to trace execution flow.

Check cluster status

For infrastructure-level issues, check the Cluster Status tab:

Navigate to your deployment version
Click Cluster Status
Review pod status and events

Common issues:

OOMKilled (Out of Memory) Your endpoint consumed more memory than allocated, and Kubernetes terminated it.

Solution: Increase memory allocation in valohai.yaml:

- endpoint:
    name: predict
    memory_limit: 2048  # Increase from default

ImagePullBackOff Kubernetes can't pull your Docker image.

Solution: Verify the base image exists and is accessible.

CrashLoopBackOff Your endpoint starts but immediately crashes.

Solution: Check logs for startup errors (missing files, import failures).

Common deployment issues

Syntax errors in Python code:

Test locally before deploying
Check build logs for syntax errors during image creation

Missing dependencies:

Verify all packages are in requirements-deployment.txt
Pin versions to avoid surprises: tensorflow==2.5.1

Model file not found:

Confirm the file path in valohai.yaml matches your code
Check that you selected model files when creating the version

Uvicorn not found:

Install it in requirements-deployment.txt
Update server-command to use installed path: ~/.local/bin/uvicorn

Deployment stuck in "Pending"

A deployment stays "Pending" until successfully deployed and ready to accept requests.

What "Pending" means: Valohai is building your Docker image, deploying to Kubernetes, or waiting for health checks to pass. This usually takes 2-5 minutes.

If it's stuck for more than 10 minutes, check:

Endpoint logs for runtime errors:
- Python syntax errors
- Missing dependencies
- Model loading failures
Cluster Status for infrastructure issues:
- OOMKilled (out of memory)
- ImagePullBackOff (can't pull Docker image)
- CrashLoopBackOff (endpoint crashes on startup)
Build logs (if available):
- Dependency installation failures
- Base image not found

Common fixes:

Increase memory allocation if you see OOMKilled
Verify all packages are in requirements-deployment.txt
Confirm model files aren't too large for allocated resources
Test your endpoint locally before deploying

PreviousInstall Packages NextBatch Inference

Last updated 2 days ago

Was this helpful?