Recap on datums
In Valohai, datums are individual files, each identified by a unique identifier, typically in the format datum://<datum-id>
. These identifiers play a crucial role in defining inputs for other executions and can also be aliased to simplify access.
Datums are excellent for single files. However, they become less convenient when you’re dealing with collections of files. Managing and updating inputs for changed data can become quite complex.
To address these challenges, Valohai provides a feature called “Datasets.” Datasets are designed to simplify the handling and tracking of groups of files. You can use datasets as inputs in your executions, version them, create aliases, and effortlessly keep an eye on any changes made to the dataset. This feature is particularly useful when you’re working with multiple related files and need a more organized approach.
Dataset as an input
Much like datum://
links, you can use dataset://
URLs as inputs in your Valohai execution. However, instead of just one file, the entire set of files within the dataset will be uploaded under the specified input.
Here’s an example of how to use datasets as inputs in your valohai.yaml
:
- name: my-execution
image: my-docker-image
command: python main.py
inputs:
- name: my-dataset-input
description: Input from dataset
default: dataset://dataset-name/latest
Of course, you can also utilize this address in the URL field within the Valohai web app.
Create a dataset
You’ll first need to create a dataset, and then a dataset version. The dataset version determines which files are part of the dataset in that spefic version.
It’s not possible to edit dataset versions after creation. However, it is possible to use an existing dataset version as the base for a new one. Just click on the three dots in the Dataset Versions table and choose “Create new version from this version”. You can both add and remove files from the new version.
Web app
Dataset
To create a dataset, follow the steps below.
- Go to the Data tab of your project
- Select the Dataset tab
- Click on the Create dataset button
- Choose a Name and Owner for the dataset
- To share the dataset with your team, mark your organization as the owner.
- Click on the Create button
Dataset version
To add data to your dataset, you will need to create a new version.
- Click on the dataset name
- Click on the Create new version button
- Choose one or more datums to add to the dataset.
- You can for filter the list by for example filename, tags or data store.
- Click on the Add or Add Selected button
- Give a name to the dataset version
- Click on the Save new version
You can freely add and remove datums to the dataset, untill you save that specific version.
Programatically
You can also create a new dataset version from your execution outputs by saving an additional JSON file *.metadata.json
for each of your execution’s output files.
import valohai
import json
metadata = {
"valohai.dataset-versions": ["dataset://<dataset-name>/<dataset-version-name>"]
}
save_path = '/valohai/outputs/model.pkl'
model.save(save_path)
metadata_path = '/valohai/outputs/model.pkl.metadata.json'
with open(metadata_path, 'w') as outfile:
json.dump(metadata, outfile)
If the dataset, defined here as <dataset-name>
, does not exists, a new one will be created.
Similarly to creating new versions based on existing ones in the UI, you can do that also programatically with the .metadata.json sidecar file.
Add files to a dataset version
To add files to an existing dataset, either use the web app or the programatic approach:
import valohai
import json
metadata = {
"valohai.dataset-versions": [{
'uri': "dataset://<dataset-name>/<new-dataset-version-name>",
'from': "dataset://<dataset-name>/<original-dataset-version-name>",
'start_fresh': False,
'exclude': ['exclude1.csv', 'exclude2.csv']
}]
}
save_path = '/valohai/outputs/model.pkl'
model.save(save_path)
metadata_path = '/valohai/outputs/model.pkl.metadata.json'
with open(metadata_path, 'w') as outfile:
json.dump(metadata, outfile)
You should provide the filenames of the datums to be excluded in a list.
- start_fresh: false means that all the datums except for those listed in exclude will be included in the new dataset version.
- start_fresh: true will exclude all the files from the original dataset. The advantage is that you can then set the original dataset as the previous version of the new dataset without having to include any of its datums.
API
You can create a new dataset version by sending a POST request to https://app.valohai.com/api/v0/dataset-versions/
. The dataset, version name and files to include are defined in the request body.
{
"name": "<version-name>",
"dataset": "<dataset-UUID>",
"files": [
{"datum": "<datum-UUID>"},
{"datum": "<datum-UUID>"},
{"datum": "<datum-UUID>"}
]
}