Download input data

Note

This tutorial is a part of our Valohai Fundamentals series.

By defining inputs you can easily download data from a public address or your private cloud storage.

In this section you will learn:

  • How to define Valohai inputs

  • How to change inputs between executions both in the CLI and in the UI

A short introduction inputs

  • Valohai inputs can be from a public location (HTTP/HTTPS) or from your private cloud storage (AWS S3, Azure Storage, GCP Cloud Storage, OpenStack Swift)

  • The input values you define in your code are default values. You can replace any defined inputs file(s) when creating an execution from the UI, command-line or API.

  • All inputs are downloaded and available during an execution to /valohai/inputs/<input-name>/

  • An input value can be a single file (e.g. myimages.tar.gz) or you can use a wildcard to download multiple files from a private cloud storage (e.g. s3://mybucket/images/*.jpeg)

  • You can interpolate parameter values into input URIs with the syntax s3://mybucket/images/{parameter:myparam}/**.jpeg. This is of particular use in tasks, where you can now easily run your execution on multiple variable datasets.

  • Valohai inputs are cached on the virtual machine.

Update train.py to add inputs:

  • Create a dictionary to pass valohai.prepare your inputs and their default values

  • Update the mnist_file_path to point to the Valohai inputs.

You should also remove the mnist.npz from your local machine.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
import tensorflow as tf
import valohai


valohai.prepare(
    step='train-model',
    image='tensorflow/tensorflow:2.6.0',
    default_inputs={
        'dataset': 'https://valohaidemo.blob.core.windows.net/mnist/mnist.npz'
    },
    default_parameters={
        'learning_rate': 0.001,
        'epochs': 5,
    },
)

input_path = valohai.inputs('dataset').path()
with np.load(input_path, allow_pickle=True) as f:
    x_train, y_train = f['x_train'], f['y_train']
    x_test, y_test = f['x_test'], f['y_test']

x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10)
])

optimizer = tf.keras.optimizers.Adam(learning_rate=valohai.parameters('learning_rate').value)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer,
            loss=loss_fn,
            metrics=['accuracy'])

model.fit(x_train, y_train, epochs=valohai.parameters('epoch').value)

model.evaluate(x_test,  y_test, verbose=2)

output_path = valohai.outputs().path('model.h5')
model.save(output_path)

Run in Valohai

Update your valohai.yaml - Config File with vh yaml step. This will generate a inputs section in your step.

vh yaml step train.py

Finally run a new Valohai execution.

vh exec run train-model --adhoc

Rerun an execution with different input data

  • Open your project on app.valohai.com

  • Open the latest execution

  • Click Copy

  • Scroll down to the Inputs section and remove the current input.

  • You can now either pass in a new URI or select an input from the Data list (for example, if you’ve uploaded a file)

  • Click Create execution

Tip

You can also run a new execution with different input value from the command line:

vh exec run train-model --adhoc --mnist=https://myurl.com/differentfile.npz