# Troubleshoot Endpoints

Diagnose and fix issues with deployment endpoints using logs, cluster status, and local testing.

### Test locally first

Before deploying to Valohai, run your endpoint locally to catch issues early.

**Benefits:**

* Immediate feedback on code errors
* Easier debugging with local tools
* Faster iteration without waiting for builds

**Test your FastAPI endpoint:**

```shell
pip install -r requirements-deployment.txt
uvicorn predict:app --reload
```

Visit `http://localhost:8000/docs` to test your endpoints interactively.

### Check endpoint logs

The most direct way to debug runtime issues:

1. Open your deployment
2. Select the failing version
3. Click the **Log** tab

**What to look for:**

* Python stack traces
* Import errors
* Model loading failures
* Request/response errors

### Add custom logging

Enhance debugging by logging key events in your code:

```python
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


@app.post("/predict")
def predict(data: dict):
    logger.info(f"Received prediction request with {len(data)} features")

    try:
        result = model.predict(data)
        logger.info(f"Prediction successful: {result}")
        return result
    except Exception as e:
        logger.error(f"Prediction failed: {str(e)}")
        raise
```

These logs appear in the endpoint logs, making it easier to trace execution flow.

### Check cluster status

For infrastructure-level issues, check the **Cluster Status** tab:

1. Navigate to your deployment version
2. Click **Cluster Status**
3. Review pod status and events

**Common issues:**

**OOMKilled (Out of Memory)** Your endpoint consumed more memory than allocated, and Kubernetes terminated it.

**Solution:** Increase memory allocation in `valohai.yaml`:

```yaml
- endpoint:
    name: predict
    memory_limit: 2048  # Increase from default
```

**ImagePullBackOff** Kubernetes can't pull your Docker image.

**Solution:** Verify the base image exists and is accessible.

**CrashLoopBackOff** Your endpoint starts but immediately crashes.

**Solution:** Check logs for startup errors (missing files, import failures).

### Common deployment issues

**Syntax errors in Python code:**

* Test locally before deploying
* Check build logs for syntax errors during image creation

**Missing dependencies:**

* Verify all packages are in `requirements-deployment.txt`
* Pin versions to avoid surprises: `tensorflow==2.5.1`

**Model file not found:**

* Confirm the file path in `valohai.yaml` matches your code
* Check that you selected model files when creating the version

**Uvicorn not found:**

* Install it in `requirements-deployment.txt`
* Update `server-command` to use installed path: `~/.local/bin/uvicorn`

### Deployment stuck in "Pending"

A deployment stays "Pending" until successfully deployed and ready to accept requests.

**What "Pending" means:** Valohai is building your Docker image, deploying to Kubernetes, or waiting for health checks to pass. This usually takes 2-5 minutes.

**If it's stuck for more than 10 minutes, check:**

1. **Endpoint logs** for runtime errors:
   * Python syntax errors
   * Missing dependencies
   * Model loading failures
2. **Cluster Status** for infrastructure issues:
   * OOMKilled (out of memory)
   * ImagePullBackOff (can't pull Docker image)
   * CrashLoopBackOff (endpoint crashes on startup)
3. **Build logs** (if available):
   * Dependency installation failures
   * Base image not found

**Common fixes:**

* Increase memory allocation if you see OOMKilled
* Verify all packages are in `requirements-deployment.txt`
* Confirm model files aren't too large for allocated resources
* Test your endpoint locally before deploying
