Troubleshoot Endpoints
Diagnose and fix issues with deployment endpoints using logs, cluster status, and local testing.
Test locally first
Before deploying to Valohai, run your endpoint locally to catch issues early.
Benefits:
Immediate feedback on code errors
Easier debugging with local tools
Faster iteration without waiting for builds
Test your FastAPI endpoint:
pip install -r requirements-deployment.txt
uvicorn predict:app --reloadVisit http://localhost:8000/docs to test your endpoints interactively.
Check endpoint logs
The most direct way to debug runtime issues:
Open your deployment
Select the failing version
Click the Log tab
What to look for:
Python stack traces
Import errors
Model loading failures
Request/response errors
Add custom logging
Enhance debugging by logging key events in your code:
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@app.post("/predict")
def predict(data: dict):
logger.info(f"Received prediction request with {len(data)} features")
try:
result = model.predict(data)
logger.info(f"Prediction successful: {result}")
return result
except Exception as e:
logger.error(f"Prediction failed: {str(e)}")
raiseThese logs appear in the endpoint logs, making it easier to trace execution flow.
Check cluster status
For infrastructure-level issues, check the Cluster Status tab:
Navigate to your deployment version
Click Cluster Status
Review pod status and events
Common issues:
OOMKilled (Out of Memory) Your endpoint consumed more memory than allocated, and Kubernetes terminated it.
Solution: Increase memory allocation in valohai.yaml:
- endpoint:
name: predict
memory_limit: 2048 # Increase from defaultImagePullBackOff Kubernetes can't pull your Docker image.
Solution: Verify the base image exists and is accessible.
CrashLoopBackOff Your endpoint starts but immediately crashes.
Solution: Check logs for startup errors (missing files, import failures).
Common deployment issues
Syntax errors in Python code:
Test locally before deploying
Check build logs for syntax errors during image creation
Missing dependencies:
Verify all packages are in
requirements-deployment.txtPin versions to avoid surprises:
tensorflow==2.5.1
Model file not found:
Confirm the file path in
valohai.yamlmatches your codeCheck that you selected model files when creating the version
Uvicorn not found:
Install it in
requirements-deployment.txtUpdate
server-commandto use installed path:~/.local/bin/uvicorn
Deployment stuck in "Pending"
A deployment stays "Pending" until successfully deployed and ready to accept requests.
What "Pending" means: Valohai is building your Docker image, deploying to Kubernetes, or waiting for health checks to pass. This usually takes 2-5 minutes.
If it's stuck for more than 10 minutes, check:
Endpoint logs for runtime errors:
Python syntax errors
Missing dependencies
Model loading failures
Cluster Status for infrastructure issues:
OOMKilled (out of memory)
ImagePullBackOff (can't pull Docker image)
CrashLoopBackOff (endpoint crashes on startup)
Build logs (if available):
Dependency installation failures
Base image not found
Common fixes:
Increase memory allocation if you see OOMKilled
Verify all packages are in
requirements-deployment.txtConfirm model files aren't too large for allocated resources
Test your endpoint locally before deploying
Last updated
Was this helpful?
