# Dynamic Inputs

Skip the wait. Start processing terabytes of data while it's still downloading.

Instead of downloading your entire dataset before execution starts, on-demand inputs let you stream files during runtime. Perfect for scenarios like processing a 500GB dataset where you only need to analyze the first 10% to determine next steps.

> **Prerequisites:** Understand [data concepts](/data.md) and [getting started with inputs](/migration-strategy/migrate-job-inputs.md) first.

## When to Use On-Demand Inputs

**Use on-demand inputs when:**

* Your datasets are hundreds of GBs or larger
* You need to start processing immediately without waiting for downloads
* You only need partial data from large input sets
* You want to stream data analysis in real-time

**Skip on-demand inputs for:**

* Small datasets (under 1GB) where download time isn't a concern
* Workflows that need all data available before starting

## Configuration

Mark inputs with `download: on-demand` to enable streaming:

```yaml
- step:
    name: stream-processing
    image: python:3.12
    command:
      - python ./process_stream.py
    inputs:
      # Download of this input will be delayed until the data is explicitly requested
      - name: large_dataset
        download: on-demand
        default: dataset://big-data/latest

      # This input will be downloaded before the execution starts
      - name: config
        default: datum://9122f6fc-2aff-5366-6abd-ffbd7461f1aa
```

Your execution starts immediately!

\
Inputs marked as `on-demand` are not yet downloaded, but the metadata describing them is included in the `/valohai/config/inputs.json` file (as well as `/valohai/config/inputs.yaml`). \\

These inputs (marker as `on-demand`) will be download only once (if ever) execution explicitly request them.

## Quick Option: valohai-utils

If you're using Python and want automatic download handling, valohai-utils provides a simple interface:

```python
import valohai

# Downloads all files in the input automatically
for filepath in valohai.inputs("large_dataset").paths():
    process_file(filepath)
```

**Limitations:**

* Downloads every file in the input (no selective downloading)
* No custom retry logic
* Requires Python environment

## Advanced option: Manual Download API

For maximum control over what gets downloaded and when, use the input request API directly.

> 💡 Use the manual API for production workflows with large datasets where you need selective downloading.

### 1. Get API Configuration

Read `/valohai/config/api.json` to find your execution's unique API endpoint:

```json
{
  "input_request": {
    "method": "POST",
    "url": "https://app.valohai.com/request-input-data/?execution={unique-execution-id}",
    "headers": {
      "Authorization": "Execution-Token {unique-execution-token}"
    }
  }
}
```

This token is valid only during your execution's lifetime.

### 2. Find Input IDs

Get input metadata from `/valohai/config/inputs.json`:

```json5
{
  "large_dataset": { // <- input name
    "input_id": "5c83f6fb-1cb2-4fc3-bda6-2cbd7461f16f",
    "files": [ // <- list of files this input resolves to
	{
	  "datum_id": "019a17bd-e035-e6a6-b76e-4f4ea313b46a",
	  "download_intent": "on-demand",
	  "input_id": "5c83f6fb-1cb2-4fc3-bda6-2cbd7461f16f",
	  "metadata": [ // <- custom properties, tags and dataset versions this datum belongs to
	    {
	      "index": 4,
	      "prefix": "linking",
	      "valohai.dataset-versions": ["dataset://big-data/latest"]
	    }
	  ],
       	  "name": "dogs/000201.jpg",
	  "path": "/valohai/inputs/large_dataset/dogs/000201.jpg",
	  "size": 1828388,
	  "storage_uri": "https://{store-url}/{bucket-name}/{path-in-bucket}?{aws-signing-specific-properties}",
	  "uri": "datum://019a17bd-e035-e6a6-b76e-4f4ea313b46a"
	},
    // ... <- one object of this kind for each file input resolves to
    ]
  }
}
```

### 3. Request Download URLs

Make a POST request to get pre-signed download URLs. Filter by input ID to get only what you need:

```shell
curl -X POST \
  "https://app.valohai.com/request-input-data/?execution={execution-id}&input={input-id}" \
  -H "Authorization: Execution-Token {unique-execution-token}"
```

**Response:**

```json
[
  {
    "name": "large_dataset",
    "files": [
      {
        "filename": "dogs/000201.jpg",
        "original_uri": "datum://019a17bd-e035-e6a6-b76e-4f4ea313b46a",
        "url": "https://{store-url}/{bucket-name}/{path-in-bucket}?{aws-signing-specific-properties}",
        "input_id": "5c83f6fb-1cb2-4fc3-bda6-2cbd7461f16f",
        "metadata": {
          "...": "..."
        },
        "download_intent": "on-demand"
      }
    ]
  }
]
```

> :bulb: Note that each file description in `/valohai/config/inputs.json` already has a signed URL assigned as `storage_uri` field. You can use this URI to download the file, but keep in mind that these URIs will expire after the set period (configured on the bucket/store level) and will have to be renewed.

### 4. Download Files Selectively

Use the `url` field to download only the files you need:

```python
import requests, json

def get_header():
    with open("/valohai/config/api.json", "r") as f:
        return json.load(f)["input_request"]["headers"]

def get_input_id(input_name: str):
    with open("/valohai/config/inputs.json", "r") as f:
        return json.load(f)[input_name]["input_id"]

def get_url():
    with open("/valohai/config/api.json", "r") as f:
        return json.load(f)["input_request"]["url"]

headers = get_header()
input_id = get_input_id("large_dataset")

url = f"{get_url()}&input={input_id}"

response = requests.post(url, headers=headers)
## TODO handle failure in post request
response_data = response.json()

# Download specific files based on your processing logic
for file in response_data[0]["files"]:
    if should_process_file(file["filename"]):
        download_file(file["url"], file["filename"])
```

You control exactly what gets downloaded and when.

> :bulb: Pre-signed URLs will eventually expire (exact duration depends on the store configuration). If per-file processing takes a lot of time, it could happen that URLs for still non-processed files will expire. If you start getting such errors when trying to download file, you will have to request pre-signed url again (by using the same URL).

### Authentication Methods

**Pre-signed URLs (Default):** Download URLs work immediately without additional authentication. Most secure and convenient for most use cases.

**Machine Roles:** For security-conscious environments, configure Valohai to use IAM Instance Roles, Service Accounts, or other machine-based authentication instead of pre-signed URLs. Contact support to configure this option.

> :warning: If you are expecting to have more than \~100k input files, **Machine Roles** will be required method of authentication.

### Error Handling

Implement your own retry logic for download failures:

```python
import time
import requests


def download_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = requests.get(url, stream=True)
            response.raise_for_status()
            return response
        except requests.RequestException as e:
            if attempt == max_retries - 1:
                raise e
            time.sleep(2**attempt)  # Exponential backoff
```

Network interruptions and temporary failures are common with large file downloads.

## File Lifecycle

Valohai will take care of all the data downloaded using it's internal mechanism, this includes:

* Keeping downloaded data in cache too speed up successive executions
* Removing all unused data
* Performing necessary clearing if machine is running out of disk space

Limitation of the manual approach of downloading inputs would be that you would have to do the above yourselves.

:warning: **Things worth keeping in mind with manual inputs download**:

* Take care of the available disk space and, if necessary, remove unused files
* Files you manually download will be located in the execution container. If you would like to persist them for the next execution, consider [mounting an external directory](/data/data-nfs.md) from the host machine or shared network storage.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.valohai.com/data/data-versioning/dynamic-inputs.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
