Azure Files

Mount Azure Files file system to access shared network storage directly from Valohai executions.

Overview

Azure Files provides managed SMB storage that you can mount in Valohai executions. Use Azure Files to:

Access large datasets without downloading
Share preprocessed data across multiple executions
Cache intermediate results on fast shared storage
Process data in place and save versioned outputs

⚠️ Important: Files in file share mounts are NOT versioned by Valohai. Always save final results to /valohai/outputs/ for reproducibility.

Prerequisites

Before mounting Azure Files storage in Valohai:

Existing Azure Files instance — Use an existing storage or create a new one in Azure
Network access — Make sure the Azure Files storage is accessible from your worker instances

Setup: Configure Azure Files Access

Step 1: Get the Azure Files storage information

Navigate to your Azure storage account that contains the file share
Under the Overview page for the file share, click on Connect and then navigate under the Linux tab.
Take note of the following values:
1. File share connection string, e.g. //storage-account-name.file.core.windows.net/file-share-name
2. username, should match the storage account name
3. password

In Valohai, navigate to projects Settings → Environment Variables
Add the connection string, username and password as environment variables. Make sure to mark at least the password as a secret!

You can also save the values into an environment variable group to easily share them between several projects in Valohai.

Mount Azure Files file share in Execution

Mount Configuration

valohai.yaml:

- step:
    name: process-with-fileshare
    image: python:3.9
    command:
      - python process_data.py
    mounts:
      - destination: /mnt/fileshare-data
        source: ${FILE_SHARE}
        type: smb
        options:
          username: ${USERNAME}
          password: ${PASSWORD}
        readonly: true

Parameters:

destination — Mount point inside container (e.g., /mnt/fileshare-data)
source — File share connection string (format: //storage-account-name.file.core.windows.net/file-share-name), passed here from the FILE_SHARE environment variable
type — Depends on your file share type, either smb or cifs
username — Username for the file share, passed here from the USERNAME environment variable
password — Password for the file share, passed here from the PASSWORD environment variable
readonly — true (recommended) or false

You can hardcode the connection string, username and password but, especially for the latter, that is not recommended for security reasons.

Complete Workflow Example

Mount → Process → Save Pattern

Scenario: Process large video dataset stored on file share, extract features, save to Valohai outputs.

valohai.yaml:

- step:
    name: extract-video-features
    image: python:3.9
    command:
      - pip install opencv-python numpy pandas
      - python extract_features.py
    mounts:
      - destination: /mnt/raw-videos
        source: ${FILE_SHARE}
        type: smb
        options:
          username: ${USERNAME}
          password: ${PASSWORD}
        readonly: true
    environment-variables:
      - name: SAMPLE_RATE
        default: "1"  # Process every Nth frame

extract_features.py:

import os
import cv2
import numpy as np
import pandas as pd
import json

# Configuration
FILESHARE_PATH = '/mnt/raw-videos/'
OUTPUT_PATH = '/valohai/outputs/'
SAMPLE_RATE = int(os.getenv('SAMPLE_RATE', '1'))

# 1. Scan file share for videos (NOT versioned)
print(f"Scanning file share mount: {FILESHARE_PATH}")
video_files = [f for f in os.listdir(FILESHARE_PATH) if f.endswith(('.mp4', '.avi'))]
print(f"Found {len(video_files)} videos on file share")

# 2. Extract features from videos
features_list = []

for i, filename in enumerate(video_files):
    video_path = os.path.join(FILESHARE_PATH, filename)

    # Read video from file share
    cap = cv2.VideoCapture(video_path)

    frame_count = 0
    video_features = []

    while True:
        ret, frame = cap.read()
        if not ret:
            break

        # Sample frames based on SAMPLE_RATE
        if frame_count % SAMPLE_RATE == 0:
            # Extract features (e.g., histogram)
            hist = cv2.calcHist([frame], [0, 1, 2], None, [8, 8, 8], [0, 256, 0, 256, 0, 256])
            hist = hist.flatten()
            video_features.append(hist)

        frame_count += 1

    cap.release()

    # Aggregate features for this video
    if video_features:
        avg_features = np.mean(video_features, axis=0)

        features_list.append({
            'video_filename': filename,
            'frame_count': frame_count,
            'sampled_frames': len(video_features),
            'features': avg_features.tolist()
        })

    if (i + 1) % 10 == 0:
        print(f"Processed {i + 1}/{len(video_files)} videos...")

# 3. Save features to Valohai outputs (VERSIONED ✅)
print(f"\nSaving {len(features_list)} video features...")

# Save as CSV for analysis
df = pd.DataFrame([
    {
        'video_filename': item['video_filename'],
        'frame_count': item['frame_count'],
        'sampled_frames': item['sampled_frames']
    }
    for item in features_list
])
df.to_csv(os.path.join(OUTPUT_PATH, 'video_metadata.csv'), index=False)

# Save features as numpy array
features_array = np.array([item['features'] for item in features_list])
np.save(os.path.join(OUTPUT_PATH, 'video_features.npy'), features_array)

# 4. Create dataset version
dataset_metadata = {
    "video_metadata.csv": {
        "valohai.dataset-versions": [{
            "uri": "dataset://video-features/batch-001"
        }],
        "valohai.tags": ["video-analysis", "features", "extracted"]
    },
    "video_features.npy": {
        "valohai.dataset-versions": [{
            "uri": "dataset://video-features/batch-001"
        }]
    }
}

metadata_path = os.path.join(OUTPUT_PATH, 'valohai.metadata.jsonl')
with open(metadata_path, 'w') as f:
    for filename, file_meta in dataset_metadata.items():
        json.dump({"file": filename, "metadata": file_meta}, f)
        f.write('\n')

print(f"\nFeature extraction complete:")
print(f"  - Processed {len(features_list)} videos")
print(f"  - Extracted features shape: {features_array.shape}")
print(f"  - Created dataset: dataset://video-features/batch-001")

Result:

✅ Raw videos accessed from file share (no download time for 100GB+ videos)
✅ Extracted features saved to /valohai/outputs/ (versioned)
✅ Dataset created for reproducible model training
✅ Can train on dataset://video-features/batch-001 anytime

Best Practices

Use Readonly for Input Data

# ✅ Good: Readonly prevents accidental modifications
mounts:
  - destination: /mnt/training-data
    readonly: true

# ⚠️ Avoid: Writeable unless necessary
mounts:
  - destination: /mnt/training-data
    readonly: false

Always Version Final Results

# ❌ Bad: Only use file share, nothing versioned
results = process_data('/mnt/fileshare-data/')
results.save('/mnt/fileshare-output/results.pkl')  # NOT versioned

# ✅ Good: File share for input, outputs for results
results = process_data('/mnt/fileshare-data/')
results.save('/valohai/outputs/results.pkl')  # VERSIONED

/share1/
├── raw-data/
│   ├── images/
│   ├── videos/
│   └── text/
├── cache/
│   └── preprocessing/
├── team-workspace/
│   ├── experiments/
│   └── notebooks/
└── temp/

Clear organization makes mounting and access control easier.

Handle Mount Errors

import os
import sys

FILESHARE_PATH = '/mnt/fileshare-data/'

# Verify mount is accessible
if not os.path.exists(FILESHARE_PATH):
    print(f"ERROR: File share mount {FILESHARE_PATH} not accessible")
    print("Possible causes:")
    print("  - Wrong connection string in mount configuration")
    print("  - Wrong password or username in mount configuration")
    print("  - Network connectivity issue")
    sys.exit(1)

# Verify expected data exists
expected_dir = os.path.join(FILESHARE_PATH, 'datasets')
if not os.path.exists(expected_dir):
    print(f"WARNING: Expected directory not found: {expected_dir}")
    print(f"Available: {os.listdir(FILESHARE_PATH)}")

print(f"File share mount verified: {FILESHARE_PATH}")

Maintaining Reproducibility

⚠️ Critical: Azure Files file share data can change between executions. Always save processed results to /valohai/outputs/ for versioning.

The problem:

# Today: Process data from file share
data = load_from_fileshare('/mnt/fileshare-data/')
model = train(data)

# Next week: Someone updates file share data
# Retraining gives different results
# Can't reproduce original model

The solution:

# Load from file share (current state)
data = load_from_fileshare('/mnt/fileshare-data/')

# Save snapshot to versioned outputs
data.to_csv('/valohai/outputs/training_snapshot.csv')

# Create dataset version
metadata = {
    "training_snapshot.csv": {
        "valohai.dataset-versions": [{
            "uri": "dataset://training-data/2024-01-15"
        }]
    }
}

# Train on versioned snapshot in next execution
# Can reproduce exactly anytime

See: Access Network Storage for complete patterns.

Access Network Storage — Overview and when to use NFS
AWS Elastic File System — AWS equivalent
Google Cloud Filestore — GCP equivalent
On-Premises NFS — Mount on-prem storage
Load Data in Jobs — Alternative: Valohai's versioned inputs

Next Steps

Set up Azure Files file share in your Azure resource group (or use existing)
Get file share information for connecting to it
Create test execution mounting the file share
Build pipeline: mount → process → save to outputs
Monitor file share performance in Azure console

PreviousGoogle Cloud Filestore NextModels

Last updated 4 days ago

Was this helpful?

hashtagOverview

hashtagPrerequisites

hashtagSetup: Configure Azure Files Access

hashtagStep 1: Get the Azure Files storage information

hashtagStep 2: Store the file share information as environment variables

hashtagMount Azure Files file share in Execution

hashtagMount Configuration

hashtagComplete Workflow Example

hashtagMount → Process → Save Pattern

hashtagBest Practices

hashtagUse Readonly for Input Data

hashtagAlways Version Final Results

hashtagOrganize Your File Share Structure

hashtagHandle Mount Errors

hashtagMaintaining Reproducibility

hashtagRelated Pages

hashtagNext Steps