# Azure Files

Mount Azure Files file system to access shared network storage directly from Valohai executions.

***

### Overview

Azure Files provides managed SMB storage that you can mount in Valohai executions. Use Azure Files to:

* Access large datasets without downloading
* Share preprocessed data across multiple executions
* Cache intermediate results on fast shared storage
* Process data in place and save versioned outputs

> ⚠️ **Important:** Files in file share mounts are NOT versioned by Valohai. Always save final results to `/valohai/outputs/` for reproducibility.

***

### Prerequisites <a href="#prerequisites" id="prerequisites"></a>

Before mounting Azure Files storage in Valohai:

1. **Existing Azure Files instance** — Use an existing storage or create a new one in Azure
2. **Network access** — Make sure the Azure Files storage is accessible from your worker instances

***

### Setup: Configure Azure Files Access

#### Step 1: Get the Azure Files storage information

1. Navigate to your **Azure storage account** that contains the file share
2. Under the **Overview** page for the file share, click on **Connect** and then navigate under the **Linux tab**.
3. Take note of the following values:
   1. File share connection string, e.g. `//storage-account-name.file.core.windows.net/file-share-name`
   2. `username`, should match the storage account name
   3. `password`

#### Step 2: Store the file share information as environment variables

1. In Valohai, navigate to projects Settings → Environment Variables
2. Add the connection string, `username` and `password` as environment variables. Make sure to mark at least the `password` as a secret!

<figure><img src="https://4109720758-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Ff3mjTRQNkASbnMbJqzJ2%2Fuploads%2FezTZyFBHevtgHNxTjSHw%2Fimage.png?alt=media&#x26;token=1dbd86b9-0612-4a5f-9cbd-49d6faaa8538" alt=""><figcaption></figcaption></figure>

{% hint style="info" %}
You can also save the values into [an environment variable group](https://docs.valohai.com/user-and-organization-management/getting-started/environment-variables) to easily share them between several projects in Valohai.
{% endhint %}

***

### Mount Azure Files file share in Execution <a href="#mount-filestore-in-execution" id="mount-filestore-in-execution"></a>

#### **Mount Configuration**

**valohai.yaml:**

```yaml
- step:
    name: process-with-fileshare
    image: python:3.9
    command:
      - python process_data.py
    mounts:
      - destination: /mnt/fileshare-data
        source: ${FILE_SHARE}
        type: smb
        options:
          username: ${USERNAME}
          password: ${PASSWORD}
        readonly: true
```

**Parameters:**

* `destination` — Mount point inside container (e.g., `/mnt/fileshare-data`)
* `source` — File share connection string (format: `//storage-account-name.file.core.windows.net/file-share-name`), passed here from the `FILE_SHARE` environment variable
* `type` — Depends on your file share type, either `smb` or `cifs`
* `username` — Username for the file share, passed here from the `USERNAME` environment variable
* `password` — Password for the file share, passed here from the `PASSWORD` environment variable
* `readonly` — `true` (recommended) or `false`

You can hardcode the connection string, username and password but, especially for the latter, that is not recommended for security reasons.

***

### Complete Workflow Example

#### Mount → Process → Save Pattern

**Scenario:** Process large video dataset stored on file share, extract features, save to Valohai outputs.

**valohai.yaml:**

```yaml
- step:
    name: extract-video-features
    image: python:3.9
    command:
      - pip install opencv-python numpy pandas
      - python extract_features.py
    mounts:
      - destination: /mnt/raw-videos
        source: ${FILE_SHARE}
        type: smb
        options:
          username: ${USERNAME}
          password: ${PASSWORD}
        readonly: true
    environment-variables:
      - name: SAMPLE_RATE
        default: "1"  # Process every Nth frame
```

**extract\_features.py:**

```python
import os
import cv2
import numpy as np
import pandas as pd
import json

# Configuration
FILESHARE_PATH = '/mnt/raw-videos/'
OUTPUT_PATH = '/valohai/outputs/'
SAMPLE_RATE = int(os.getenv('SAMPLE_RATE', '1'))

# 1. Scan file share for videos (NOT versioned)
print(f"Scanning file share mount: {FILESHARE_PATH}")
video_files = [f for f in os.listdir(FILESHARE_PATH) if f.endswith(('.mp4', '.avi'))]
print(f"Found {len(video_files)} videos on file share")

# 2. Extract features from videos
features_list = []

for i, filename in enumerate(video_files):
    video_path = os.path.join(FILESHARE_PATH, filename)

    # Read video from file share
    cap = cv2.VideoCapture(video_path)

    frame_count = 0
    video_features = []

    while True:
        ret, frame = cap.read()
        if not ret:
            break

        # Sample frames based on SAMPLE_RATE
        if frame_count % SAMPLE_RATE == 0:
            # Extract features (e.g., histogram)
            hist = cv2.calcHist([frame], [0, 1, 2], None, [8, 8, 8], [0, 256, 0, 256, 0, 256])
            hist = hist.flatten()
            video_features.append(hist)

        frame_count += 1

    cap.release()

    # Aggregate features for this video
    if video_features:
        avg_features = np.mean(video_features, axis=0)

        features_list.append({
            'video_filename': filename,
            'frame_count': frame_count,
            'sampled_frames': len(video_features),
            'features': avg_features.tolist()
        })

    if (i + 1) % 10 == 0:
        print(f"Processed {i + 1}/{len(video_files)} videos...")

# 3. Save features to Valohai outputs (VERSIONED ✅)
print(f"\nSaving {len(features_list)} video features...")

# Save as CSV for analysis
df = pd.DataFrame([
    {
        'video_filename': item['video_filename'],
        'frame_count': item['frame_count'],
        'sampled_frames': item['sampled_frames']
    }
    for item in features_list
])
df.to_csv(os.path.join(OUTPUT_PATH, 'video_metadata.csv'), index=False)

# Save features as numpy array
features_array = np.array([item['features'] for item in features_list])
np.save(os.path.join(OUTPUT_PATH, 'video_features.npy'), features_array)

# 4. Create dataset version
dataset_metadata = {
    "video_metadata.csv": {
        "valohai.dataset-versions": [{
            "uri": "dataset://video-features/batch-001"
        }],
        "valohai.tags": ["video-analysis", "features", "extracted"]
    },
    "video_features.npy": {
        "valohai.dataset-versions": [{
            "uri": "dataset://video-features/batch-001"
        }]
    }
}

metadata_path = os.path.join(OUTPUT_PATH, 'valohai.metadata.jsonl')
with open(metadata_path, 'w') as f:
    for filename, file_meta in dataset_metadata.items():
        json.dump({"file": filename, "metadata": file_meta}, f)
        f.write('\n')

print(f"\nFeature extraction complete:")
print(f"  - Processed {len(features_list)} videos")
print(f"  - Extracted features shape: {features_array.shape}")
print(f"  - Created dataset: dataset://video-features/batch-001")
```

**Result:**

* ✅ Raw videos accessed from file share (no download time for 100GB+ videos)
* ✅ Extracted features saved to `/valohai/outputs/` (versioned)
* ✅ Dataset created for reproducible model training
* ✅ Can train on `dataset://video-features/batch-001` anytime

***

### Best Practices

#### Use Readonly for Input Data

```yaml
# ✅ Good: Readonly prevents accidental modifications
mounts:
  - destination: /mnt/training-data
    readonly: true
```

```yaml
# ⚠️ Avoid: Writeable unless necessary
mounts:
  - destination: /mnt/training-data
    readonly: false
```

***

#### Always Version Final Results

```python
# ❌ Bad: Only use file share, nothing versioned
results = process_data('/mnt/fileshare-data/')
results.save('/mnt/fileshare-output/results.pkl')  # NOT versioned

# ✅ Good: File share for input, outputs for results
results = process_data('/mnt/fileshare-data/')
results.save('/valohai/outputs/results.pkl')  # VERSIONED
```

***

#### Organize Your File Share Structure

```
/share1/
├── raw-data/
│   ├── images/
│   ├── videos/
│   └── text/
├── cache/
│   └── preprocessing/
├── team-workspace/
│   ├── experiments/
│   └── notebooks/
└── temp/
```

Clear organization makes mounting and access control easier.

***

#### Handle Mount Errors

```python
import os
import sys

FILESHARE_PATH = '/mnt/fileshare-data/'

# Verify mount is accessible
if not os.path.exists(FILESHARE_PATH):
    print(f"ERROR: File share mount {FILESHARE_PATH} not accessible")
    print("Possible causes:")
    print("  - Wrong connection string in mount configuration")
    print("  - Wrong password or username in mount configuration")
    print("  - Network connectivity issue")
    sys.exit(1)

# Verify expected data exists
expected_dir = os.path.join(FILESHARE_PATH, 'datasets')
if not os.path.exists(expected_dir):
    print(f"WARNING: Expected directory not found: {expected_dir}")
    print(f"Available: {os.listdir(FILESHARE_PATH)}")

print(f"File share mount verified: {FILESHARE_PATH}")
```

***

### Maintaining Reproducibility

> ⚠️ **Critical:** Azure Files file share data can change between executions. Always save processed results to `/valohai/outputs/` for versioning.

**The problem:**

```python
# Today: Process data from file share
data = load_from_fileshare('/mnt/fileshare-data/')
model = train(data)

# Next week: Someone updates file share data
# Retraining gives different results
# Can't reproduce original model
```

**The solution:**

```python
# Load from file share (current state)
data = load_from_fileshare('/mnt/fileshare-data/')

# Save snapshot to versioned outputs
data.to_csv('/valohai/outputs/training_snapshot.csv')

# Create dataset version
metadata = {
    "training_snapshot.csv": {
        "valohai.dataset-versions": [{
            "uri": "dataset://training-data/2024-01-15"
        }]
    }
}

# Train on versioned snapshot in next execution
# Can reproduce exactly anytime
```

**See:** [Access Network Storage](https://docs.valohai.com/data/data-nfs) for complete patterns.

***

### Related Pages

* [Access Network Storage](https://docs.valohai.com/data/data-nfs) — Overview and when to use NFS
* [AWS Elastic File System](https://docs.valohai.com/data/data-nfs/aws-efs) — AWS equivalent
* [Google Cloud Filestore](https://docs.valohai.com/data/data-nfs/google-filestore) — GCP equivalent
* [On-Premises NFS](https://docs.valohai.com/data/data-nfs/onprem-nfs) — Mount on-prem storage
* [Load Data in Jobs](https://docs.valohai.com/data/data-versioning/load-files-in-jobs) — Alternative: Valohai's versioned inputs

***

### Next Steps

* Set up Azure Files file share in your Azure resource group (or use existing)
* Get file share information for connecting to it
* Create test execution mounting the file share
* Build [pipeline](https://docs.valohai.com/pipelines): mount → process → [save to outputs](https://docs.valohai.com/data/data-versioning/save-files-from-jobs)
* Monitor file share performance in Azure console
