Why can’t I upload files larger than 5GB?
Valohai requires an additional role in your AWS S3 Data Store configuration to be able to upload large files from executions.
Follow the docs how-to guide to configure the role in AWS and then update your Data Store configuration in Valohai to include the multipart role.
Why am I getting Network Error when uploading a file?
You can upload files that are less than 5GB through Valohai’s data upload tab. Large files should be uploaded directly using your cloud provider’s tools.
A common reason for a network error is that you haven’t defined the CORS policy for your cloud object storage. Check our documentation for details on how to configure your object data store.
How can I find the latest production version of my model file?
We suggest using an alias to keep track of the latest version of a specific file.
Data aliases are aliases to Valohai datum URLs, enabling you to update the data being used without any changes to code. You can create a data alias in your project and use it as an input for your executions.
Create an alias in the web app
- Open your project
- Open the Data tab
- Open the Aliases tab
- Select Create a new datum alias
- Give the alias a name (for example dataset-a-latest) and select which file should the alias point to
- Now you can use
datum://dataset-a-latest
as an input for your execution. Valohai will resolve it to the file that is currently marked for that alias and run the job with that specific file.
Create or update an alias programmatically
You can create or update an alias to point to a new file whenever saving a file to Valohai outputs.
In addition to saving the file, you’ll need to create and save a JSON file that tells Valohai to update the alias to this new file.
The name of the JSON file is always yourfile.ext.metadata.json
. If you’re saving a file called dataset.csv
, then the JSON file needs to be called dataset.csv.metadata.json
.
import valohai
import json
import pandas
# Some sample data
data = {'Name': ['Tom', 'Joseph', 'Krish', 'John'], 'Age': [20, 21, 19, 18]}
df = pandas.DataFrame(data)
save_path = valohai.outputs().path('dataset.csv')
df.to_csv(save_path) # save your dataframe as a CSV file
# Create a sidecar file for Valohai
# Tell Valohai to update the `dataset-a-latest` alias to point to this file
metadata = {
"valohai.alias": "dataset-a-latest", # creates or updates a Valohai data alias to point to this output file
}
metadata_path = valohai.outputs().path('dataset.csv.metadata.json')
with open(metadata_path, 'w') as outfile:
json.dump(metadata, outfile)
Where can I find a link to a file that was created in a Valohai execution?
You can download metadata as a CSV or JSON file from one or multiple executions directly from the metadata tab.
The download raw data button is visible on the metadata tab. You can view this on:
- A single execution
- Multiple executions, by selecting the executions in the Executions tab and clicking compare.
- A Task
You can also download metadata using the Valohai APIs:
- A single execution: https://app.valohai.com/api/v0/executions/{id}/metadata/
- Download metadata from a single execution as a CSV file: https://app.valohai.com/api/v0/executions/multi_download_metadata_csv/
- From multiple executions: https://app.valohai.com/api/v0/executions/multi_metadata/
- Send a POST request with the execution id’s:
{"ids":["01730928-f2c3-b962-3d71-addd235c2f09","0173dcc6-5c12-9cbb-8713-71d75e654a6e"]}
Can I use output files across different projects?
Yes. The datum URI works across projects, as long as the projects have access to the same cloud data store.
403: Forbidden when trying to download inputs to an execution
Valohai uses the authentication credentials specified for your project to download data from your cloud object storage.
This can be set either on the organization level by the org admins or for a single project under the project settings.
What’s the difference between datum:// and the cloud storage link?
- Once a file is uploaded to Valohai, or saved through an execution, you’ll receive a
datum://
link for the execution. - You’ll also get a link to your own data store if you’ve configured a new default data store in your project.
We suggest using the datum://
link when you’re in the same project. This way Valohai will keep track of how the file is being used and knows where the data originated from.
How can I use a whole folder as an input for an execution?
Valohai inputs can be a single file or multiple files. You can either list the files you want for your input or use a wildcard to download all the files.
- step:
name: train
image: tensorflow/tensorflow:2.6.0
command: python train.py
inputs:
- name: model
default:
- s3://mybucket/models/model.pb
- name: images
default:
- s3://mybucket/images/*.jpg
keep-directories: suffix
The keep-directories
is used to define what folder structure Valohai should use in the inputs folder.
none
(default): All files are downloaded to/valohai/inputs/myinput
.full
: Keeps the full path from the storage root. For example,s3://special-bucket/foo/bar/**.jpg
could end up as/valohai/inputs/myinput/foo/bar/dataset1/a.jpg
.suffix
: Keeps the suffix from the “wildcard root.” For example,s3://special-bucket/foo/bar/*
, thespecial-bucket/foo/bar/
would be removed, but any relative path after it would be kept, and you might end up with/valohai/inputs/myinput/dataset1/a.jpg
.
How can I view outputs mid-execution?
You use live outputs to upload files mid-execution.
Normally, executions will upload all files stored at /valohai/outputs
at the end of the execution, even if the code crashes or is stopped. But in some long-running workloads, you might want to save checkpoints or other artifacts mid-execution.
When Valohai detects that a file under /valohai/outputs
is marked as read-only, Valohai will remove that file from the directory and upload it right away. All programming languages and shells have a way for marking files read-only, e.g., os.chmod
in Python.
A simple example:
echo hello >> /valohai/outputs/greeting.txt
chmod 0444 /valohai/outputs/greeting.txt
sleep 30
echo bye >> /valohai/outputs/farewell.txt
# => Generates 2 files:
# - 'greeting.txt' with 'hello', uploaded right away
# - 'farewell.txt' with 'bye', uploaded after the execution finishes
How can I delete output files?
You can purge files in Valohai to delete them from your cloud storage.
Option 1) Delete files from a project
- Open the Data tab in the project
- Click on the … at the end of each row and click Purge to delete the file.
Option 2) Delete files from an execution’s outputs
- Open an execution and go to the outputs tab
- Click on Purge all outputs to remove all files or delete files one by one from the table below from the … menu
Option 3) Delete an execution and all of its files
- Click on the checkbox next to one or multiple executions in the executions table in your project
- Click on Delete and check the Purge all outputs too checkbox to delete all the files generated from these executions
Error: “No connection adapters were found for”
This error message appears when you run a Python script that uses valohai-utils
locally on your machine. valohai-utils
can’t download the files because they are in your private storage behind authentication.
In the example below, you can see that you already have a folder called .valohai/inputs
and subfolders named after each step that you tried to run:
You need to download or move the files into the correct local subfolder on your machine so valohai-utils can access them.
The files need to be renamed in the following way:
Datum For datum default URL, the filename should be the datum UUID (no file extension).
0181663c-4305-2cf0-298e-eb670f5c06dc
S3
For S3 (s3://mybucket/images/data.tar.gz
) you should have a file called:
data.tar.gz
Unable to fetch files from S3 using boto3 and instance profile
In most cases, we recommend using Valohai inputs for any files you need during your executions. That way, the information will get tracked, and anyone with access to your project can rerun the execution, as long as the files still exist in your data store.
That being said, sometimes you might need to get access to an S3 bucket from inside the execution. When using boto3, you will either need to provide the credentials (recommended to use secret environment variables) or you can allow the access based on the machine instance profile.
When using the instance profile, note that the items in the bucket are encrypted with a custom AWS KMS key that the InstanceProfile doesn’t have access to. This will result in the following error:
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the GetObject operation: The ciphertext refers to a customer master key that does not exist, does not exist in this region, or you are not allowed to access.
Follow these instructions from AWS knowledge center to give the IAM Role permissions to the KMS Key used.
You can verify which KMS key is being used by:
- Open the AWS Console’s S3 page.
- Open your bucket and navigate to one of the files you’re trying to download.
- Click the checkbox next to one of the folders/files and click “Edit server-side encryption” to review which key is being used there.
It looks something like this: Encryption key ARN arn:aws:kms:eu-west-1:ACCOUNT:key/111aa2bb-333c-4d44-5555-a111bb2c33dd
.
How to Detect Data Freshness for Outputs
To detect data freshness for outputs, you have a few options. Here are some recommendations:
-
Logging File Metadata: One way to detect data freshness is by logging the time as metadata by using the .metadata.json sidecar file. Whenever an output is generated, include the timestamp as metadata. Later, when you need to check the freshness of the data, you can retrieve the metadata and compare it with the current time.
-
File Modification/Creation Time: Another approach is to check the file modification or creation time. This method requires you to load the files into an execution and inspect their timestamps. By comparing the file timestamps with the desired freshness criteria, you can determine if the data is up to date.
- Other Methods: There may be other ways to detect data freshness depending on your specific use case. For example, you could use APIs like the DatumRetrieve API endpoint to query information about the data and determine its freshness.
It’s important to note that Valohai does not have a built-in feature specifically for detecting the latest files or data freshness for outputs. However, you can implement one of the above methods to achieve this functionality.
Remember to adjust the freshness threshold according to your specific requirements. You could for example create a daily triggered execution that would check a project outputs or for example the latest dataset files with an API. If they are older than X days, the execution would trigger a new pipeline/execution to process the data.