Dynamic Inputs

Configure and use on-demand inputs to stream large datasets during execution without waiting for full downloads.

Skip the wait. Start processing terabytes of data while it's still downloading.

Instead of downloading your entire dataset before execution starts, on-demand inputs let you stream files during runtime. Perfect for scenarios like processing a 500GB dataset where you only need to analyze the first 10% to determine next steps.

Prerequisites: Understand data concepts and getting started with inputs first.

When to Use On-Demand Inputs

Use on-demand inputs when:

  • Your datasets are hundreds of GBs or larger

  • You need to start processing immediately without waiting for downloads

  • You only need partial data from large input sets

  • You want to stream data analysis in real-time

Skip on-demand inputs for:

  • Small datasets (under 1GB) where download time isn't a concern

  • Workflows that need all data available before starting

Configuration

Mark inputs with download: on-demand to enable streaming:

- step:
    name: stream-processing
    image: python:3.12
    command:
      - python ./process_stream.py
    inputs:
      # Download of this input will be delayed until the data is explicitly requested 
      - name: large_dataset
        download: on-demand
        default: dataset://big-data/latest
        
      # This input will be downloaded before the execution starts
      - name: config
        default: datum://9122f6fc-2aff-5366-6abd-ffbd7461f1aa        

Your execution starts immediately!

Inputs marked as on-demand are not yet downloaded, but the metadata describing them is included in the /valohai/config/inputs.json file (as well as /valohai/config/inputs.yaml). \

These inputs (marker as on-demand) will be download only once (if ever) execution explicitly request them.

Quick Option: valohai-utils

If you're using Python and want automatic download handling, valohai-utils provides a simple interface:

import valohai

# Downloads all files in the input automatically
for filepath in valohai.inputs("large_dataset").paths():
    process_file(filepath)

Limitations:

  • Downloads every file in the input (no selective downloading)

  • No custom retry logic

  • Requires Python environment

Advanced option: Manual Download API

For maximum control over what gets downloaded and when, use the input request API directly.

💡 Use the manual API for production workflows with large datasets where you need selective downloading.

1. Get API Configuration

Read /valohai/config/api.json to find your execution's unique API endpoint:

{
  "input_request": {
    "method": "POST",
    "url": "https://app.valohai.com/request-input-data/?execution={unique-execution-id}",
    "headers": {
      "Authorization": "Execution-Token {unique-execution-token}"
    }
  }
}

This token is valid only during your execution's lifetime.

2. Find Input IDs

Get input metadata from /valohai/config/inputs.json:

{
  "large_dataset": { // <- input name 
    "input_id": "5c83f6fb-1cb2-4fc3-bda6-2cbd7461f16f",
    "files": [ // <- list of files this input resolves to 
	{
	  "datum_id": "019a17bd-e035-e6a6-b76e-4f4ea313b46a",
	  "download_intent": "on-demand",
	  "input_id": "5c83f6fb-1cb2-4fc3-bda6-2cbd7461f16f",
	  "metadata": [ // <- custom properties, tags and dataset versions this datum belongs to
	    {
	      "index": 4,
	      "prefix": "linking",
	      "valohai.dataset-versions": ["dataset://big-data/latest"]
	    }
	  ],
       	  "name": "dogs/000201.jpg",
	  "path": "/valohai/inputs/large_dataset/dogs/000201.jpg",
	  "size": 1828388,
	  "storage_uri": "https://{store-url}/{bucket-name}/{path-in-bucket}?{aws-signing-specific-properties}",
	  "uri": "datum://019a17bd-e035-e6a6-b76e-4f4ea313b46a"
	}, 
    // ... <- one object of this kind for each file input resolves to
    ]
  }
}

3. Request Download URLs

Make a POST request to get pre-signed download URLs. Filter by input ID to get only what you need:

curl -X POST \
  "https://app.valohai.com/request-input-data/?execution={execution-id}&input={input-id}" \
  -H "Authorization: Execution-Token {unique-execution-token}"

Response:

[
  {
    "name": "large_dataset",
    "files": [
      {
        "filename": "dogs/000201.jpg",
        "original_uri": "datum://019a17bd-e035-e6a6-b76e-4f4ea313b46a",
        "url": "https://{store-url}/{bucket-name}/{path-in-bucket}?{aws-signing-specific-properties}",
        "input_id": "5c83f6fb-1cb2-4fc3-bda6-2cbd7461f16f",
        "metadata": {"...": "..."},
        "download_intent": "on-demand"
      }
    ]
  }
]

💡 Note that each file description in /valohai/config/inputs.json already has a signed URL assigned as storage_uri field. You can use this URI to download the file, but keep in mind that these URIs will expire after the set period (configured on the bucket/store level) and will have to be renewed.

4. Download Files Selectively

Use the url field to download only the files you need:

import requests, json 

def get_header():
    with open("/valohai/config/api.json", "r") as f: 
        return json.load(f)["input_request"]["headers"]

def get_input_id(input_name: str):
    with open("/valohai/config/inputs.json", "r") as f: 
        return json.load(f)[input_name]["input_id"]

def get_url():
    with open("/valohai/config/api.json", "r") as f: 
        return json.load(f)["input_request"]["url"]

headers = get_header()
input_id = get_input_id("large_dataset")

url = f"{get_url()}&input={input_id}"

response = requests.post(url, headers=headers)
## TODO handle filure in post request 
response_data = response.json()

# Download specific files based on your processing logic
for file in response_data[0]["files"]:
    if should_process_file(file["filename"]):
        download_file(file["url"], file["filename"])

You control exactly what gets downloaded and when.

💡 Pre-signed URLs will eventually expire (exact duration depends on the store configuration). If per-file processing takes a lot of time, it could happen that URLs for still non-processed files will expire. If you start getting such errors when trying to download file, you will have to request pre-signed url again (by using the same URL).

Authentication Methods

Pre-signed URLs (Default): Download URLs work immediately without additional authentication. Most secure and convenient for most use cases.

Machine Roles: For security-conscious environments, configure Valohai to use IAM Instance Roles, Service Accounts, or other machine-based authentication instead of pre-signed URLs. Contact support to configure this option.

⚠️ If you are expecting to have more than ~100k input files, Machine Roles will be required method of authentication.

Error Handling

Implement your own retry logic for download failures:

import time
import requests

def download_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = requests.get(url, stream=True)
            response.raise_for_status()
            return response
        except requests.RequestException as e:
            if attempt == max_retries - 1:
                raise e
            time.sleep(2 ** attempt)  # Exponential backoff

Network interruptions and temporary failures are common with large file downloads.

File Lifecycle

Valohai will take care of all the data downloaded using it's internal mechanism, this includes:

  • Keeping downloaded data in cache too speed up successive executions

  • Removing all unused data

  • Performing necessary clearing if machine is running out of disk space

Limitation of the manual approach of downloading inputs would be that you would have to do the above yourselves.

⚠️ Things worth keeping in mind with manual inputs download:

  • Take care of the available disk space and, if necessary, remove unused files

  • Files you manually download will be located in the execution container. If you would like to persist them for the next execution, consider mounting an external directory from the host machine or shared network storage.

Last updated

Was this helpful?