Load Files in Jobs
Gather necessary files, from various stores, without the need for custom download implementation for each.
Valohai inputs are data files, or collections of files, accessible during an execution, fetched from your cloud storage or public sources (HTTP/HTTPS)
Supported sources include AWS S3, Azure Storage, GCP Cloud Storage, Oracle buckets, on-prem S3, or any public links.
Valohai simplifies data handling:
Manages authentication with your cloud storage.
Handles downloading, unpacking and caching.
Eliminates the need to manage keys or authentication in your code.
Define an input
In your valohai.yaml, each step can have one or multiple inputs, each resulting in one or more files.
You can set default values for inputs, which can be overridden every time you create an execution. For instance, you can change the set of images for batch inference.
- step:
name: image_processor
environment: ec2-instance
image: python:3.10
command:
- python process_image.py
inputs:
## Results in one file
## e.g /valohai/inputs/image/<image-name>
- name: image
default: datum://01234567-89ab-cdef-0123-456789abcdef
## Results in a group of files with .jpg extension
## e.g
## /valohai/inputs/image_set/image_1.jpg
## /valohai/inputs/image_set/image_2.jpg
## ...
- name: image_set
default: s3://mybucket/unprocessed/*.jpg
## As is, will result in no files being download
## Later on you can override this input and make it resolve to one or more files
- name: additional_data
optional: trueSelect single file
To select single file as an input you can use:
Object store URL
Amazon S3:
s3://{bucket}/{key}Azure Blob Storage:
azure://{account_name}/{container_name}/{blob_name}Google Storage:
gs://{bucket}/{key}OpenStack Swift:
swift://{project}/{container}/{key}
Datum URI
datum://01234567-89ab-cdef-0123-456789abcdef
Any public http/https URL
https://somewebsite.com/some_image
Select collection of files
Wildcards with object store URLs
s3://my-bucket/dataset/images/*.jpgfor all .jpg (JPEG) filess3://my-bucket/dataset/image-sets/**.jpgfor recursing subdirectories for all .jpg (JPEG) filesFor example, expression above will match:
s3://my-bucket/dataset/image-sets/cats/big/cat-1.jpgs3://my-bucket/dataset/image-sets/cats/small/cat-9.jpgs3://my-bucket/dataset/image-sets/dog/dog-5.jpg...
💡 Parameter interpolation
You can also interpolate execution parameters into input URIs:
s3://my-bucket/dataset/images/{parameter:user-id}/*.jpegwould replace{parameter:user-id}with the value of the parameteruser-idduring an execution.
Datasets
Since datasets are actually just pointers to the collections of files, using dataset version URL as an input will result in every file included in that version to be downloaded. \
An example of dataset version URL: dataset://boats/semi-processed
💡 Latest version
You can always use
latestas a version, for any dataset, to point to the latest dataset version.e.g:
dataset://boats/latest
Datum queries
Using the UI you can select the files to be downloaded by executing a simple query on datum properties.

Query defined in the example above will match all datums that:
have
indexproperty greater than 10 and haveprocessed_dateproperty assigned (no matter of it's value) and have propertytypeequals to "dog".Limitfield determines the maximum number of files that will be returned by the query (it's optional)
⚠️ Non reproducible
Be aware that the results of these queries will be calculated each time you create an execution. If new files are added, with the matching properties, in between the executions, query results will not be the same - the more recent one will include newly added files as well!
Directory structure
Without any additional configuration, examples above will result in all files being downloaded in a single directory under /valohai/inputs/<input-name>/. From their original path (where they are stored in a store e.g s3://bucket-name/dir/subdir/file.jpg) only the file name (basename) will be kept.
To preserve the directory structure found in the store, you can use keep-directories input property.
keep-directories can take next values:
none: (default) all files are downloaded to
/valohai/inputs/myinputfull: keeps the full path from the storage root. For example
s3://special-bucket/foo/bar/**.jpgwill end up as/valohai/inputs/<input-name>/foo/bar/dataset1/a.jpgsuffix: keeps the suffix from the “wildcard root”. For example
s3://special-bucket/foo/bar/*thespecial-bucket/foo/bar/would be removed, but any relative path after it would be kept, and you might end up with/valohai/inputs/myinput/dataset1/a.jpg
Example:
- step:
name: image_processor
environment: ec2-instance
image: python:3.10
command:
- python process_image.py
inputs:
- name: image_set
keep-directories: suffix
default: s3://mybucket/unprocessed/*Access files
Download location
All requested data is downloaded to a local directory /valohai/inputs on the machine used for execution- no matter if that's a cloud instance or a physical on-prem machine, behavior is the same.
Each input has its own directory structure within this location. From the example above, three directories will be created:
/valohai/inputs/image//valohai/inputs/image_set//valohai/inputs/additional_data/if some value is provided when an execution is created, otherwise, the directory for this input will not be created, as it's marked asoptional
Using valohai-utils
valohai-utils library provides a convenient way of traversing the downloaded files just by specifying the input name, e.g:
import valohai
input_name = "training_data"
for file_path in valohai.inputs(input_name).paths():
with open(file_path, "r") as f:
content = f.read()
## Do some processingUsing system environment variables
To avoid the use of additional libraries, VH_INPUTS_DIR environment variable will be provided in every execution environment, pointing to /valohai/inputs . You can then use this path in such way:
import os
input_name = "training_data"
inputs_dir = os.getenv("VH_OUTPUTS_DIR", default="./inputs")
inputs_path = os.path.join(inputs_dir, input_name)
for file_path in os.listdir(inputs_path):
with open(file_path, "r") as f:
content = f.read()
## Do some processing
Last updated
Was this helpful?
