Import Existing Cloud Files
Import files that already exist in your cloud storage into Valohai's data catalog. This creates datum:// links for external files without moving or re-uploading data.
When to Use This
Import existing cloud files when you need to:
Track legacy data — Bring pre-existing datasets into Valohai's lineage system
Use external datasets — Access data uploaded by other teams or processes
Avoid re-uploading — Create datum links for large files already in your cloud storage
Migrate to Valohai — Import historical data when adopting Valohai for existing projects
Share team data — Make files uploaded directly to cloud storage discoverable in Valohai
What This Does
"Adopting" files creates Valohai datum records pointing to your existing cloud storage files.
💡 Important: Files remain in your cloud storage. Nothing is copied or moved. Valohai creates tracking metadata so you can use these files like any other Valohai data.
After importing, you can:
Use files as inputs in executions
View and search files in the Valohai UI
Add tags, aliases, and properties
Track lineage and usage
Create datasets from imported files
Requirements
Before importing files:
Files Must Be Accessible
The files must exist in a data store that's already configured in your Valohai project. Valohai uses the store's credentials to verify file existence.
Learn more: Configure Data Stores
You Need Project Permissions
You must have permission to add data to the project.
Files Must Use Correct URL Format
Use your cloud provider's native URL format:
AWS S3:
s3://bucket-name/path/to/file.extGoogle Cloud Storage:
gs://bucket-name/path/to/file.extAzure Blob Storage:
azure://account/container/path/to/file.extOpenStack Swift:
swift://project/container/path/to/file.ext
Import via Web UI
Open your project
Navigate to the Data tab
Click the Adopt tab
Select the Destination store from the dropdown menu
Enter file URLs to import (one per line)
Click Adopt selected files
Valohai verifies each file exists and creates datum records.
⚠️ Performance warning: Importing many files (100+) via the web UI can be slow and may be interrupted by network issues or browser timeouts. For bulk imports, use the API with retry logic instead.
Example input:
s3://my-bucket/datasets/training-data-v1.csv
s3://my-bucket/datasets/training-data-v2.csv
s3://my-bucket/models/baseline-model.pklImport via API
Use the Valohai API for bulk imports or automated workflows.
Prerequisites
Get your API token from Valohai account settings
Get your project ID from Project → Settings in the Valohai UI
Get your data store ID:
import os
import requests
response = requests.get(
'https://app.valohai.com/api/v0/stores/',
headers={'Authorization': 'Token ' + os.getenv('VH_TOKEN')}
)
stores = response.json()
# Find your store ID in the responseImport Single File
import os
import requests
store_id = "your-store-id-here"
project_id = "your-project-id-here"
payload = {
"url": "s3://my-bucket/datasets/training-data.csv",
"root_path":"s3://my-bucket/",
"project": project_id
}
response = requests.post(
f'https://app.valohai.com/api/v0/stores/{store_id}/adopt/',
json=payload,
headers={
'Authorization': 'Token ' + os.getenv('VH_TOKEN'),
'Content-Type': 'application/json'
}
)
# Handle response
if response.ok:
result = response.json()
if result.get('ok'):
print(f"Success! Datum ID: {result['created']}")
else:
print(f"Error: {result['message']}")
else:
print(f"HTTP Error: {response.status_code}")
print(response.text)Import Multiple Files with Error Handling
import os, requests, time
token = os.getenv('VH_TOKEN')
store_id = "your-store-id-here"
project_id = "your-project-id-here"
files_to_import = [
"s3://my-bucket/datasets/train.csv",
"s3://my-bucket/datasets/validation.csv",
"s3://my-bucket/models/baseline.pkl"
]
root_path = "s3://my-bucket/"
def adopt_files(urls, root_path, store_id, project_id, token: str, retries: int = 3):
headers = {"Authorization": f"Token {token}"}
payload = {
"urls": urls,
"root_path": root_path,
"project": project_id
}
url = f"https://app.valohai.com/api/v0/stores/{store_id}/adopt/"
for attempt in range(retries):
try:
res = requests.post(url, json=payload, headers=headers, timeout=30)
result = res.json()
if res.ok and "ok" in result:
return {"success": True, "datums": list(result["created"].values())}
else:
return {"success": False, **(result[0])}
except requests.exceptions.RequestException as e:
if attempt < retries - 1:
wait_time = 2 ** attempt # Exponential backoff
print(f"Retry {attempt + 1}/{retries} for {url} after {wait_time}s...")
time.sleep(wait_time)
else:
return {"success": False, "error": f"Request failed: {str(e)}", "code": -1}
adopt_result = adopt_files(
urls=files_to_import,
root_path=root_path,
store_id=store_id,
project_id=project_id,
token=token)
if adopt_result['success']:
print(f" ✓ Success: {adopt_result['datums']}")
else:
print(f" ✗ Error: {adopt_result['error']}")ROOT_PATH
This parameter, even though similar to Upload path used with manual files upload, works in an opposite way.
root_path is an optional, common path, which will be deducted from all URLs in order to produce the path on which the files will be imported.
Example:
urls = [
"s3://some-bucket/dir_1/dir_2/file_1.txt",
"s3://some-bucket/dir_1/dir_2/file_2.txt"
"s3://some-bucket/dir_1/dir_3/file_2.txt"
]
┌─ root_path = s3://some-bucket
│ └─→ dir_1/dir_2/file_1.txt
│ └─→ dir_1/dir_2/file_2.txt
│ └─→ dir_1/dir_3/file_2.txt
│
┌─ root_path = s3://some-bucket/dir_1
│ └─→ dir_2/file_1.txt
│ └─→ dir_2/file_2.txt
│ └─→ dir_3/file_2.txt
│
┌─ root_path = s3://some-bucket/dir_1/dir_2
│ └─→ Error: {'success': False, 'message': 'All adopted URLs must start with the given root path', 'code': 'invalid_root_path'}
API Responses
Success
File imported successfully:
{
"ok": true,
"created": {
"s3://my-bucket/my-file.txt": "017a515f-30a4-d0f1-d37a-53ffc38e90c7"
}
}What to do: Save the datum ID (017a515f-...) to use as datum://017a515f-... in your pipelines.
Already Exists
File was previously imported:
{
"message": "s3://my-bucket/my-file.txt already exists in my-bucket",
"code": "adopt_already_exists"
}What to do:
The file is already tracked in Valohai
Find it in the Data → Browse tab
No action needed unless you want to update metadata
Not Found
File doesn't exist in cloud storage:
{
"message": "Not found in my-bucket: 's3://my-bucket/my-file.txt'",
"code": "adoptable_file_not_found"
}What to do:
Verify the file URL is correct (check spelling, path, bucket name)
Ensure the file exists in your cloud storage
Check that Valohai has access to the bucket/container
Verify you selected the correct data store
Common Issues & Fixes
File Not Found Error
Symptom: adoptable_file_not_found error during import
Causes & Fixes:
Typo in file URL → Double-check bucket name, path, and filename (case-sensitive)
File doesn't exist → Verify file exists in your cloud storage console
Wrong data store selected → Ensure you selected the correct destination store
Wrong cloud region → Check that the data store is configured for the correct region
File in different bucket → Verify the bucket name matches your data store configuration
Permission Denied
Symptom: Import fails with access or permission error
Causes & Fixes:
Data store credentials invalid → Verify data store configuration and credentials
Bucket policy blocks access → Check cloud storage IAM/permissions allow Valohai to read files
File is private/encrypted → Ensure Valohai's service account has read access
Cross-region access issues → Verify data store configuration matches file location
Already Exists
Symptom: File shows as already existing in Valohai
Causes & Fixes:
File was previously imported → Find it in Data → Browse tab (not an error)
Trying to import duplicate → Use the existing datum ID instead of re-importing
To find existing datum:
Go to Data → Browse
Search by filename
Copy the datum ID
Bulk Import Interrupted
Symptom: Web UI import stops partway through large file list
Causes & Fixes:
Browser timeout → Use API with retry logic for bulk imports (>50 files)
Network interruption → Import in smaller batches via UI
Too many files → Use the API script with error handling (see above)
Best Practices
Organize Before Importing
Plan your import strategy:
Group related files
Use consistent naming
Document file sources
Tag immediately after import
Use API for Bulk Operations
Web UI: Good for <50 files
API: Required for 50+ files
Add Metadata Immediately
After importing, add context:
metadata = {
"valohai.tags": ["imported-2024-01", "legacy-data"],
"valohai.alias": "training-data-v1",
"source": "s3-legacy-bucket",
"import_date": "2024-01-15",
"original_owner": "data-team"
}Create Aliases for Key Files
Make frequently-used imports easy to reference:
legacy-training-data → datum://abc123...
baseline-model → datum://def456...
validation-set-fixed → datum://ghi789...Verify After Import
Check that files are accessible:
Find imported file in Data → Browse
Create a test execution using the file as input
Verify file downloads and opens correctly
Related Pages
Load Data in Jobs — Use imported files as execution inputs
Add Context to Your Data Files — Tag and organize imported files
Configure Data Stores — Set up cloud storage access
Upload Files via Web UI — Alternative for small files
Next Steps
Import a test file and verify it appears in the Data tab
Add tags and aliases to your imported files
Create an execution using an imported file as input
Set up automated imports using the API for new files
Last updated
Was this helpful?
