Dynamic Inputs
Configure and use on-demand inputs to stream large datasets during execution without waiting for full downloads.
Skip the wait. Start processing terabytes of data while it's still downloading.
Instead of downloading your entire dataset before execution starts, on-demand inputs let you stream files during runtime. Perfect for scenarios like processing a 500GB dataset where you only need to analyze the first 10% to determine next steps.
Prerequisites: Understand data concepts and getting started with inputs first.
When to Use On-Demand Inputs
Use on-demand inputs when:
Your datasets are hundreds of GBs or larger
You need to start processing immediately without waiting for downloads
You only need partial data from large input sets
You want to stream data analysis in real-time
Skip on-demand inputs for:
Small datasets (under 1GB) where download time isn't a concern
Workflows that need all data available before starting
Configuration
Mark inputs with download: on-demand to enable streaming:
- step:
name: stream-processing
image: python:3.12
command:
- python ./process_stream.py
inputs:
# Download of this input will be delayed until the data is explicitly requested
- name: large_dataset
download: on-demand
default: dataset://big-data/latest
# This input will be downloaded before the execution starts
- name: config
default: datum://9122f6fc-2aff-5366-6abd-ffbd7461f1aa Your execution starts immediately!
Inputs marked as on-demand are not yet downloaded, but the metadata describing them is included in the /valohai/config/inputs.json file (as well as /valohai/config/inputs.yaml). \
These inputs (marker as on-demand) will be download only once (if ever) execution explicitly request them.
Quick Option: valohai-utils
If you're using Python and want automatic download handling, valohai-utils provides a simple interface:
import valohai
# Downloads all files in the input automatically
for filepath in valohai.inputs("large_dataset").paths():
process_file(filepath)Limitations:
Downloads every file in the input (no selective downloading)
No custom retry logic
Requires Python environment
Advanced option: Manual Download API
For maximum control over what gets downloaded and when, use the input request API directly.
💡 Use the manual API for production workflows with large datasets where you need selective downloading.
1. Get API Configuration
Read /valohai/config/api.json to find your execution's unique API endpoint:
{
"input_request": {
"method": "POST",
"url": "https://app.valohai.com/request-input-data/?execution={unique-execution-id}",
"headers": {
"Authorization": "Execution-Token {unique-execution-token}"
}
}
}This token is valid only during your execution's lifetime.
2. Find Input IDs
Get input metadata from /valohai/config/inputs.json:
{
"large_dataset": { // <- input name
"input_id": "5c83f6fb-1cb2-4fc3-bda6-2cbd7461f16f",
"files": [ // <- list of files this input resolves to
{
"datum_id": "019a17bd-e035-e6a6-b76e-4f4ea313b46a",
"download_intent": "on-demand",
"input_id": "5c83f6fb-1cb2-4fc3-bda6-2cbd7461f16f",
"metadata": [ // <- custom properties, tags and dataset versions this datum belongs to
{
"index": 4,
"prefix": "linking",
"valohai.dataset-versions": ["dataset://big-data/latest"]
}
],
"name": "dogs/000201.jpg",
"path": "/valohai/inputs/large_dataset/dogs/000201.jpg",
"size": 1828388,
"storage_uri": "https://{store-url}/{bucket-name}/{path-in-bucket}?{aws-signing-specific-properties}",
"uri": "datum://019a17bd-e035-e6a6-b76e-4f4ea313b46a"
},
// ... <- one object of this kind for each file input resolves to
]
}
}3. Request Download URLs
Make a POST request to get pre-signed download URLs. Filter by input ID to get only what you need:
curl -X POST \
"https://app.valohai.com/request-input-data/?execution={execution-id}&input={input-id}" \
-H "Authorization: Execution-Token {unique-execution-token}"Response:
[
{
"name": "large_dataset",
"files": [
{
"filename": "dogs/000201.jpg",
"original_uri": "datum://019a17bd-e035-e6a6-b76e-4f4ea313b46a",
"url": "https://{store-url}/{bucket-name}/{path-in-bucket}?{aws-signing-specific-properties}",
"input_id": "5c83f6fb-1cb2-4fc3-bda6-2cbd7461f16f",
"metadata": {"...": "..."},
"download_intent": "on-demand"
}
]
}
]💡 Note that each file description in
/valohai/config/inputs.jsonalready has a signed URL assigned asstorage_urifield. You can use this URI to download the file, but keep in mind that these URIs will expire after the set period (configured on the bucket/store level) and will have to be renewed.
4. Download Files Selectively
Use the url field to download only the files you need:
import requests, json
def get_header():
with open("/valohai/config/api.json", "r") as f:
return json.load(f)["input_request"]["headers"]
def get_input_id(input_name: str):
with open("/valohai/config/inputs.json", "r") as f:
return json.load(f)[input_name]["input_id"]
def get_url():
with open("/valohai/config/api.json", "r") as f:
return json.load(f)["input_request"]["url"]
headers = get_header()
input_id = get_input_id("large_dataset")
url = f"{get_url()}&input={input_id}"
response = requests.post(url, headers=headers)
## TODO handle filure in post request
response_data = response.json()
# Download specific files based on your processing logic
for file in response_data[0]["files"]:
if should_process_file(file["filename"]):
download_file(file["url"], file["filename"])You control exactly what gets downloaded and when.
💡 Pre-signed URLs will eventually expire (exact duration depends on the store configuration). If per-file processing takes a lot of time, it could happen that URLs for still non-processed files will expire. If you start getting such errors when trying to download file, you will have to request pre-signed url again (by using the same URL).
Authentication Methods
Pre-signed URLs (Default): Download URLs work immediately without additional authentication. Most secure and convenient for most use cases.
Machine Roles: For security-conscious environments, configure Valohai to use IAM Instance Roles, Service Accounts, or other machine-based authentication instead of pre-signed URLs. Contact support to configure this option.
⚠️ If you are expecting to have more than ~100k input files, Machine Roles will be required method of authentication.
Error Handling
Implement your own retry logic for download failures:
import time
import requests
def download_with_retry(url, max_retries=3):
for attempt in range(max_retries):
try:
response = requests.get(url, stream=True)
response.raise_for_status()
return response
except requests.RequestException as e:
if attempt == max_retries - 1:
raise e
time.sleep(2 ** attempt) # Exponential backoffNetwork interruptions and temporary failures are common with large file downloads.
File Lifecycle
Valohai will take care of all the data downloaded using it's internal mechanism, this includes:
Keeping downloaded data in cache too speed up successive executions
Removing all unused data
Performing necessary clearing if machine is running out of disk space
Limitation of the manual approach of downloading inputs would be that you would have to do the above yourselves.
⚠️ Things worth keeping in mind with manual inputs download:
Take care of the available disk space and, if necessary, remove unused files
Files you manually download will be located in the execution container. If you would like to persist them for the next execution, consider mounting an external directory from the host machine or shared network storage.
Last updated
Was this helpful?
