Download input data

Note

This tutorial is a part of our Valohai fundamentals series.

By defining inputs you can easily download data from a public address or your private cloud storage.

In this section you will learn:

  • How to define Valohai inputs

  • How to change inputs between executions both in the CLI and in the UI

A short introduction inputs

  • Valohai will download data files from your private cloud storage. Data can come from for example AWS S3, Azure Storage, GCP Cloud Storage, or a public source (HTTP/HTTPS).

  • Valohai handles both authentication with your cloud storage and downloading, uploading, and caching data.

    • This means that you don’t need to manage keys, authentication, and use tools like boto3, gsutils, or BlobClient in your code. Instead you should always treat the data as local data.

  • All Valohai machines have a local directory /valohai/inputs/ where all your datasets are downloaded to. Each input will have it’s own directory, for example /valohai/inputs/images/ and /valohai/inputs/model/.

  • Each step in your valohai.yaml can contain one or multiple input definitions and each input can contain one or multiple files. For example, in a batch inference step you could have a trained model file and a set of images you want to run the inference on.

  • Each input in valohai.yaml can have a default value. These values can be overriden any time you run a new execution to for example change the set of images you want to run batch inference on.

Let’s start by defining the inputs for our train-model step.

Update valohai.yaml to define new inputs:

- step:
    name: train-model
    command:
      - pip install -r requirements.txt
      - python train.py {parameters}
    image: tensorflow/tensorflow:2.6.0
    parameters:
      - name: epoch
        type: integer
        default: 5
      - name: learning_rate
        type: float
        default: 0.001
    inputs:
      - name: dataset
        default: https://valohaidemo.blob.core.windows.net/mnist/mnist.npz

Update train.py to point the mnist_file_path to the Valohai inputs.

You should also remove the mnist.npz from your local machine.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np
import tensorflow as tf
import valohai

input_path = valohai.inputs('dataset').path()
with np.load(input_path, allow_pickle=True) as f:
    x_train, y_train = f['x_train'], f['y_train']
    x_test, y_test = f['x_test'], f['y_test']

x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10)
])

optimizer = tf.keras.optimizers.Adam(learning_rate=valohai.parameters('learning_rate').value)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer,
            loss=loss_fn,
            metrics=['accuracy'])

model.fit(x_train, y_train, epochs=valohai.parameters('epoch').value)

model.evaluate(x_test,  y_test, verbose=2)

output_path = valohai.outputs().path('model.h5')
model.save(output_path)

Run in Valohai

Finally run a new Valohai execution.

vh exec run train-model --adhoc

Rerun an execution with different input data

  • Open your project on app.valohai.com

  • Open the latest execution

  • Click Copy

  • Scroll down to the Inputs section and remove the current input.

  • You can now either pass in a new URI or select an input from the Data list (for example, if you’ve uploaded a file)

  • Click Create execution

Tip

You can also run a new execution with different input value from the command line:

vh exec run train-model --adhoc --mnist=https://myurl.com/differentfile.npz