Load Files in Jobs

Gather necessary files, from various stores, without the need for custom download implementation for each.

Valohai inputs are data files, or collections of files, accessible during an execution, fetched from your cloud storage or public sources (HTTP/HTTPS)

Supported sources include AWS S3, Azure Storage, GCP Cloud Storage, Oracle buckets, on-prem S3, or any public links.

Valohai simplifies data handling:

  • Manages authentication with your cloud storage.

  • Handles downloading, unpacking and caching.

  • Eliminates the need to manage keys or authentication in your code.

Define an input

In your valohai.yaml, each step can have one or multiple inputs, each resulting in one or more files.

You can set default values for inputs, which can be overridden every time you create an execution. For instance, you can change the set of images for batch inference.

- step: 
    name: image_processor
    environment: ec2-instance
    image: python:3.10
    command:
      - python process_image.py
    inputs: 
      ## Results in one file
      ## e.g /valohai/inputs/image/<image-name>
      - name: image
        default: datum://01234567-89ab-cdef-0123-456789abcdef
        
      ## Results in a group of files with .jpg extension 
      ## e.g
      ## /valohai/inputs/image_set/image_1.jpg
      ## /valohai/inputs/image_set/image_2.jpg
      ## ...
      - name: image_set
        default: s3://mybucket/unprocessed/*.jpg
        
      ## As is, will result in no files being download 
      ## Later on you can override this input and make it resolve to one or more files
      - name: additional_data
        optional: true

Select single file

To select single file as an input you can use:

  • Object store URL

    • Amazon S3: s3://{bucket}/{key}

    • Azure Blob Storage: azure://{account_name}/{container_name}/{blob_name}

    • Google Storage: gs://{bucket}/{key}

    • OpenStack Swift: swift://{project}/{container}/{key}

  • Datum URI

    • datum://01234567-89ab-cdef-0123-456789abcdef

  • Any public http/https URL

    • https://somewebsite.com/some_image

Select collection of files

Wildcards with object store URLs

  • s3://my-bucket/dataset/images/*.jpg for all .jpg (JPEG) files

  • s3://my-bucket/dataset/image-sets/**.jpg for recursing subdirectories for all .jpg (JPEG) files

    For example, expression above will match:

    • s3://my-bucket/dataset/image-sets/cats/big/cat-1.jpg

    • s3://my-bucket/dataset/image-sets/cats/small/cat-9.jpg

    • s3://my-bucket/dataset/image-sets/dog/dog-5.jpg

    • ...

💡 Parameter interpolation

You can also interpolate execution parameters into input URIs:

s3://my-bucket/dataset/images/{parameter:user-id}/*.jpeg would replace {parameter:user-id} with the value of the parameter user-id during an execution.

Datasets

Since datasets are actually just pointers to the collections of files, using dataset version URL as an input will result in every file included in that version to be downloaded. \

An example of dataset version URL: dataset://boats/semi-processed

💡 Latest version

You can always use latest as a version, for any dataset, to point to the latest dataset version.

e.g: dataset://boats/latest

Datum queries

Using the UI you can select the files to be downloaded by executing a simple query on datum properties.

Query defined in the example above will match all datums that:

  • have index property greater than 10 and have processed_date property assigned (no matter of it's value) and have property type equals to "dog".

  • Limit field determines the maximum number of files that will be returned by the query (it's optional)

⚠️ Non reproducible

Be aware that the results of these queries will be calculated each time you create an execution. If new files are added, with the matching properties, in between the executions, query results will not be the same - the more recent one will include newly added files as well!

Directory structure

Without any additional configuration, examples above will result in all files being downloaded in a single directory under /valohai/inputs/<input-name>/. From their original path (where they are stored in a store e.g s3://bucket-name/dir/subdir/file.jpg) only the file name (basename) will be kept.

To preserve the directory structure found in the store, you can use keep-directories input property.

keep-directories can take next values:

  • none: (default) all files are downloaded to /valohai/inputs/myinput

  • full: keeps the full path from the storage root. For example s3://special-bucket/foo/bar/**.jpg will end up as /valohai/inputs/<input-name>/foo/bar/dataset1/a.jpg

  • suffix: keeps the suffix from the “wildcard root”. For example s3://special-bucket/foo/bar/* the special-bucket/foo/bar/ would be removed, but any relative path after it would be kept, and you might end up with /valohai/inputs/myinput/dataset1/a.jpg

Example:

- step: 
    name: image_processor
    environment: ec2-instance
    image: python:3.10
    command:
      - python process_image.py
    inputs: 
      - name: image_set
        keep-directories: suffix 
        default: s3://mybucket/unprocessed/*

Access files

Download location

All requested data is downloaded to a local directory /valohai/inputs on the machine used for execution- no matter if that's a cloud instance or a physical on-prem machine, behavior is the same.

Each input has its own directory structure within this location. From the example above, three directories will be created:

  • /valohai/inputs/image/

  • /valohai/inputs/image_set/

  • /valohai/inputs/additional_data/ if some value is provided when an execution is created, otherwise, the directory for this input will not be created, as it's marked as optional

Using valohai-utils

valohai-utils library provides a convenient way of traversing the downloaded files just by specifying the input name, e.g:

import valohai

input_name = "training_data"

for file_path in valohai.inputs(input_name).paths():
    with open(file_path, "r") as f: 
        content = f.read()
        ## Do some processing

Using system environment variables

To avoid the use of additional libraries, VH_INPUTS_DIR environment variable will be provided in every execution environment, pointing to /valohai/inputs . You can then use this path in such way:

import os

input_name = "training_data"
inputs_dir = os.getenv("VH_OUTPUTS_DIR", default="./inputs")

inputs_path = os.path.join(inputs_dir, input_name)
for file_path in os.listdir(inputs_path):
    with open(file_path, "r") as f:
        content = f.read()
        ## Do some processing

Last updated

Was this helpful?