To get some data for your step, you need inputs. Inputs are basically the required files that are expected to be there when the code is executed.
With valohai-utils, you define the inputs in the call to the prepare method. Feed the default_inputs argument with a key/value dictionary of the inputs.
- key is the name of the input in the configuration YAML and in the Valohai UI.
- value defines the default uri of the input and also the type.
Defining inputs with valohai-utils solves these problems:
- Handling different input variations (single file, multiple files, archived files)
- Uncompressing archived files on-demand
- Filtering a subset of the input
- Downloading the inputs for local runs
- Parsing the command-line overrides
- Managing the duplicate definitions between Python & YAML
train.py
Here we define a step train, with an input training-images.
import valohai
inputs = {
"training-images": "https://example.com/images.zip",
}
valohai.prepare(
step="train",
default_inputs=inputs,
)
This key/value pair…
inputs = {
"training-images": "https://example.com/images.zip",
}
…will be transpiled into the following YAML
inputs:
- name: training-images
default: https://example.com/images.zip
optional: false
Empty default value signals an optional input.
python
inputs = {
"extra-images": "",
}
yaml
inputs:
- name: extra-images
optional: true
Accessing input files
Once you have defined an input using the prepare method, you can access the files by referring to the input name.
In Valohai, an input is not always A single file. It can be multiple URIs. And it doesn’t end there. Each of those URIs may actually represent multiple files on multiple folders. And some of those files may actually be zip archives with multiple files and folders in them!
In other words, handling a Valohai input robustly is not as simple as it sounds. Luckily valohai-utils handles most of this complexity for you.
Use the .path()
, .paths()
, .stream()
, .streams()
methods to access files of a single input.
Single file
If you’re expecting a single file in your inputs, you can simply use .path()
.
import json
import valohai
inputs = {
"my-config": "",
}
valohai.prepare(
step="train",
default_inputs=inputs,
)
with open(valohai.inputs("my-config").path()) as f:
data = json.load(f)
Alternatively, you can also use .stream()
data = json.load(valohai.inputs("my-config").stream())
Even when you are always expecting a single file, your colleagues might still accidentally feed your input with several files!
In that case, .path()
or .stream()
returns the first file it encounters, which can be brittle.
To be more explicit about the input, you can do this:
with open(valohai.inputs("my-config").path("*.json")) as f:
data = json.load(f)
Or to be fully explicit
with open(valohai.inputs("my-config").path("config.json")) as f:
data = json.load(f)
Multiple files
When handling an input with multiple files, you want to use .paths() or .streams()
import valohai
inputs = {
"images": "https://example.com/images.zip",
}
valohai.prepare(
step="train",
default_inputs=inputs,
)
for image_path in valohai.inputs("images").paths():
# Do something per image
The beauty of .paths()
or .streams()
is that the code above will handle all of these different input scenarios:
- Single
my-image.jpg
- Multiple images
my-image1.jpg
,my-image2.jpg
,myimage-3.jpg
my-images.zip
containing multiple images- Multiple archives
my-images1.zip
,my-images2.zip
,my-images3.zip
- A hybrid mix of all the above
There is no longer a need to write a separate handler for each scenario, as valohai-utils
is taking care of everything. All you need to do is iterate over the paths of an input.
Archives
Archive files are automatically uncompressed under the hood when you are using .paths()
and friends. Currently, supported archive types are tar
and zip
.
It is worth pointing out that the archives are not prematurely uncompressed to the disk.
The library is smart and uncompresses files on-demand. When you iterate over the contents of a huge archive, each file is uncompressed one by one and the potential errors are raised immediately.
Sometimes you might want to specifically handle or uncompress the archives yourself, though.
In that case, you can set the process_archives=false
which signals valohai-utils
to not automatically uncompress the contents of archives, but return paths to the actual archive files instead.
for image_path in valohai.inputs("zipped_images").paths():
print(image_path) # image1.jpg, image2.jpg, image3.jpeg...
for archive_path in valohai.inputs("zipped_images").paths(process_archives=false):
print(archive_path) # images.zip
Filtering
When you have multiple files in multiple folders as input, you sometimes need only a subset.
All the four methods path()
, stream()
, paths()
, streams()
support a wildcard filter.
Here are some examples of how to use the filter:
valohai.inputs("images").paths()
valohai.inputs("images").paths("*.jpg")
valohai.inputs("images").paths("dog_*.jpg")
valohai.inputs("images").paths("training-set/*.jpg")
valohai.inputs("images").paths("images/**/dogs/*.jpg")
Downloading
When you run your code remotely as an execution in the Valohai platform, all the downloading of the inputs is done by the platform.
When you run your code locally, the platform is not there to help. Instead, valohai-utils downloads the files from the input URIs for you.
The files are placed in the automatically generated .valohai/inputs/{step_name}/{input_name}
subfolder.
When the code is re-executed, the library doesn’t try to download the files again but uses the cached ones from the disk. You can force the re-downloading by simply deleting the folder from the disk.
You can also create the .valohai/inputs/{step_name}/{input_name}
folder manually and place some files in it, if you just want to use local files as input instead of downloading from a URI.
Another alternative is to temporarily use a local default for an input:
inputs = {
"images": "/tmp/images.zip",
}
You can also override the default with the command line. See the next section.
Overriding input URIs
Inputs defined by the prepare often have a default value.
There are two ways to override the default (or empty) value:
Command-line parameter (local)
Example (local):
python train.py --images=/tmp/images.zip
python train.py --images=https://alternative.com/images.zip
Valohai UI or CLI (remote)
Example (remote):
vh yaml step train.py
vh exec run -a train --images=https://alternative.com/images.zip