Recap on datums
In Valohai, datums are individual files, each identified by a unique identifier, typically in the format datum://<datum-id>
. These identifiers play a crucial role in defining inputs for other executions and can also be aliased to simplify access.
Datums are excellent for single files. However, they become less convenient when you’re dealing with collections of files. Managing and updating inputs for changed data can become quite complex.
To address these challenges, Valohai provides a feature called “Datasets.” Datasets are designed to simplify the handling and tracking of groups of files. You can use datasets as inputs in your executions, version them, create aliases, and effortlessly keep an eye on any changes made to the dataset. This feature is particularly useful when you’re working with multiple related files and need a more organized approach.
Dataset as an input
Much like datum://
links, you can use dataset://
URLs as inputs in your Valohai execution. However, instead of just one file, the entire set of files within the dataset will be uploaded under the specified input.
Here’s an example of how to use datasets as inputs in your valohai.yaml
:
- name: my-execution
image: my-docker-image
command: python main.py
inputs:
- name: my-dataset-input
description: Input from dataset
default: dataset://dataset-name/latest
Of course, you can also utilize this address in the URL field within the Valohai web app.
Create a dataset
You’ll first need to create a dataset, and then a dataset version. The dataset version determines which files are part of the dataset in that specific version.
It’s not possible to edit dataset versions after creation. However, it is possible to use an existing dataset version as the base for a new one. Just click on the three dots in the Dataset Versions table and choose “Create new version from this version”. You can both add and remove files from the new version.
Web app
Dataset
To create a dataset, follow the steps below.
- Go to the Data tab of your project
- Select the Dataset tab
- Click on the Create dataset button
- Choose a Name and Owner for the dataset
- To share the dataset with your team, mark your organization as the owner.
- Click on the Create button
Dataset version
To add data to your dataset, you will need to create a new version.
- Click on the dataset name
- Click on the Create new version button
- Choose one or more datums to add to the dataset.
- You can for filter the list by for example filename, tags or data store.
- Click on the Add or Add Selected button
- Give a name to the dataset version
- Click on the Save new version
You can freely add and remove datums to the dataset, until you save that specific version.
Programmatically
You can also create a new dataset version from your execution outputs by saving an additional JSON file *.metadata.json
for each of your execution’s output files.
import valohai
import json
metadata = {
"data_file_1.csv": {
"valohai.dataset-versions": ["dataset://<dataset-name>/<dataset-version-name>"]
},
"data_file_2.csv": {
"valohai.dataset-versions": ["dataset://<dataset-name>/<dataset-version-name>"]
},
}
# Save dataset metadata
metadata_path = "/valohai/outputs/valohai.dataset.metadata.jsonl"
with open(metadata_path, "w") as outfile:
for file_name, file_metadata in metadata.items():
json.dump({"file": file_name, "metadata": file_metadata}, outfile)
outfile.write("\n")
If the dataset, defined here as <dataset-name>
, does not exists, a new one will be created.
Add files to a dataset version
To add files to an existing dataset, either use the web app or the programmatic approach:
import valohai
import json
metadata = {
"model.pkl": {
"valohai.dataset-versions": [{
'uri': "dataset://<dataset-name>/<new-dataset-version-name>",
'from': "dataset://<dataset-name>/<original-dataset-version-name>",
'start_fresh': False,
'exclude': ['exclude1.csv', 'exclude2.csv']
}]
}
}
save_path = '/valohai/outputs/model.pkl'
model.save(save_path)
metadata_path = "/valohai/outputs/valohai.dataset.metadata.jsonl"
with open(metadata_path, "w") as outfile:
for file_name, file_metadata in metadata.items():
json.dump({"file": file_name, "metadata": file_metadata}, outfile)
outfile.write("\n")
You should provide the filenames of the datums to be excluded in a list.
- start_fresh: false means that all the datums except for those listed in exclude will be included in the new dataset version.
- start_fresh: true will exclude all the files from the original dataset. The advantage is that you can then set the original dataset as the previous version of the new dataset without having to include any of its datums.
Legacy Approach of Creating a Dataset
In the legacy approach, dataset metadata was stored in separate files for each data file.
For example:
import json
metadata = {
"valohai.dataset-versions": ["dataset://<dataset-name>/<dataset-version-name>"]
}
save_path = '/valohai/outputs/data_file.csv'
#Save your data file here
metadata_path = '/valohai/outputs/data_file.csv.metadata.json'
with open(metadata_path, 'w') as outfile:
json.dump(metadata, outfile)
This approach still works, so if you have it in your project, you can continue using it. However, the JSONL-based method is recommended for consolidating metadata for multiple files into a single file. For more details on managing metadata for multiple files, see Data > Files > Save additional context.
API
You can create a new dataset version by sending a POST request to https://app.valohai.com/api/v0/dataset-versions/
. The dataset, version name and files to include are defined in the request body.
{
"name": "<version-name>",
"dataset": "<dataset-UUID>",
"files": [
{"datum": "<datum-UUID>"},
{"datum": "<datum-UUID>"},
{"datum": "<datum-UUID>"}
]
}
Troubleshooting Failed Dataset Version Creation
Problem:
When trying to programmatically create a dataset version from execution outputs, the execution completes without errors, but the dataset does not appear as expected.
How to Resolve:
A common cause for this problem is an incorrect naming of the dataset version name.
Steps to Fix:
-
Check Dataset Version Naming: Make sure the name of the dataset version is correctly formatted according to the platform’s guidelines. Errors in naming can prevent the dataset from being created properly.
-
Review Execution Alerts: Look for detailed information about why the dataset version was not created under the Alerts tab of the execution. This area will list any specific errors or warnings that occurred during the dataset creation.
- Need More Help? If you need further assistance, feel free to drop a message to
support@valohai.com
.