# Azure Files

Mount Azure Files file system to access shared network storage directly from Valohai executions.

***

### Overview

Azure Files provides managed SMB storage that you can mount in Valohai executions. Use Azure Files to:

* Access large datasets without downloading
* Share preprocessed data across multiple executions
* Cache intermediate results on fast shared storage
* Process data in place and save versioned outputs

> ⚠️ **Important:** Files in file share mounts are NOT versioned by Valohai. Always save final results to `/valohai/outputs/` for reproducibility.

***

### Prerequisites <a href="#prerequisites" id="prerequisites"></a>

Before mounting Azure Files storage in Valohai:

1. **Existing Azure Files instance** — Use an existing storage or create a new one in Azure
2. **Network access** — Make sure the Azure Files storage is accessible from your worker instances

***

### Setup: Configure Azure Files Access

#### Step 1: Get the Azure Files storage information

1. Navigate to your **Azure storage account** that contains the file share
2. Under the **Overview** page for the file share, click on **Connect** and then navigate under the **Linux tab**.
3. Take note of the following values:
   1. File share connection string, e.g. `//storage-account-name.file.core.windows.net/file-share-name`
   2. `username`, should match the storage account name
   3. `password`

#### Step 2: Store the file share information as environment variables

1. In Valohai, navigate to projects Settings → Environment Variables
2. Add the connection string, `username` and `password` as environment variables. Make sure to mark at least the `password` as a secret!

<figure><img src="https://4109720758-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Ff3mjTRQNkASbnMbJqzJ2%2Fuploads%2FezTZyFBHevtgHNxTjSHw%2Fimage.png?alt=media&#x26;token=1dbd86b9-0612-4a5f-9cbd-49d6faaa8538" alt=""><figcaption></figcaption></figure>

{% hint style="info" %}
You can also save the values into [an environment variable group](https://docs.valohai.com/user-and-organization-management/getting-started/environment-variables) to easily share them between several projects in Valohai.
{% endhint %}

***

### Mount Azure Files file share in Execution <a href="#mount-filestore-in-execution" id="mount-filestore-in-execution"></a>

#### **Mount Configuration**

**valohai.yaml:**

```yaml
- step:
    name: process-with-fileshare
    image: python:3.9
    command:
      - python process_data.py
    mounts:
      - destination: /mnt/fileshare-data
        source: ${FILE_SHARE}
        type: smb
        options:
          username: ${USERNAME}
          password: ${PASSWORD}
        readonly: true
```

**Parameters:**

* `destination` — Mount point inside container (e.g., `/mnt/fileshare-data`)
* `source` — File share connection string (format: `//storage-account-name.file.core.windows.net/file-share-name`), passed here from the `FILE_SHARE` environment variable
* `type` — Depends on your file share type, either `smb` or `cifs`
* `username` — Username for the file share, passed here from the `USERNAME` environment variable
* `password` — Password for the file share, passed here from the `PASSWORD` environment variable
* `readonly` — `true` (recommended) or `false`

You can hardcode the connection string, username and password but, especially for the latter, that is not recommended for security reasons.

***

### Complete Workflow Example

#### Mount → Process → Save Pattern

**Scenario:** Process large video dataset stored on file share, extract features, save to Valohai outputs.

**valohai.yaml:**

```yaml
- step:
    name: extract-video-features
    image: python:3.9
    command:
      - pip install opencv-python numpy pandas
      - python extract_features.py
    mounts:
      - destination: /mnt/raw-videos
        source: ${FILE_SHARE}
        type: smb
        options:
          username: ${USERNAME}
          password: ${PASSWORD}
        readonly: true
    environment-variables:
      - name: SAMPLE_RATE
        default: "1"  # Process every Nth frame
```

**extract\_features.py:**

```python
import os
import cv2
import numpy as np
import pandas as pd
import json

# Configuration
FILESHARE_PATH = '/mnt/raw-videos/'
OUTPUT_PATH = '/valohai/outputs/'
SAMPLE_RATE = int(os.getenv('SAMPLE_RATE', '1'))

# 1. Scan file share for videos (NOT versioned)
print(f"Scanning file share mount: {FILESHARE_PATH}")
video_files = [f for f in os.listdir(FILESHARE_PATH) if f.endswith(('.mp4', '.avi'))]
print(f"Found {len(video_files)} videos on file share")

# 2. Extract features from videos
features_list = []

for i, filename in enumerate(video_files):
    video_path = os.path.join(FILESHARE_PATH, filename)

    # Read video from file share
    cap = cv2.VideoCapture(video_path)

    frame_count = 0
    video_features = []

    while True:
        ret, frame = cap.read()
        if not ret:
            break

        # Sample frames based on SAMPLE_RATE
        if frame_count % SAMPLE_RATE == 0:
            # Extract features (e.g., histogram)
            hist = cv2.calcHist([frame], [0, 1, 2], None, [8, 8, 8], [0, 256, 0, 256, 0, 256])
            hist = hist.flatten()
            video_features.append(hist)

        frame_count += 1

    cap.release()

    # Aggregate features for this video
    if video_features:
        avg_features = np.mean(video_features, axis=0)

        features_list.append({
            'video_filename': filename,
            'frame_count': frame_count,
            'sampled_frames': len(video_features),
            'features': avg_features.tolist()
        })

    if (i + 1) % 10 == 0:
        print(f"Processed {i + 1}/{len(video_files)} videos...")

# 3. Save features to Valohai outputs (VERSIONED ✅)
print(f"\nSaving {len(features_list)} video features...")

# Save as CSV for analysis
df = pd.DataFrame([
    {
        'video_filename': item['video_filename'],
        'frame_count': item['frame_count'],
        'sampled_frames': item['sampled_frames']
    }
    for item in features_list
])
df.to_csv(os.path.join(OUTPUT_PATH, 'video_metadata.csv'), index=False)

# Save features as numpy array
features_array = np.array([item['features'] for item in features_list])
np.save(os.path.join(OUTPUT_PATH, 'video_features.npy'), features_array)

# 4. Create dataset version
dataset_metadata = {
    "video_metadata.csv": {
        "valohai.dataset-versions": [{
            "uri": "dataset://video-features/batch-001"
        }],
        "valohai.tags": ["video-analysis", "features", "extracted"]
    },
    "video_features.npy": {
        "valohai.dataset-versions": [{
            "uri": "dataset://video-features/batch-001"
        }]
    }
}

metadata_path = os.path.join(OUTPUT_PATH, 'valohai.metadata.jsonl')
with open(metadata_path, 'w') as f:
    for filename, file_meta in dataset_metadata.items():
        json.dump({"file": filename, "metadata": file_meta}, f)
        f.write('\n')

print(f"\nFeature extraction complete:")
print(f"  - Processed {len(features_list)} videos")
print(f"  - Extracted features shape: {features_array.shape}")
print(f"  - Created dataset: dataset://video-features/batch-001")
```

**Result:**

* ✅ Raw videos accessed from file share (no download time for 100GB+ videos)
* ✅ Extracted features saved to `/valohai/outputs/` (versioned)
* ✅ Dataset created for reproducible model training
* ✅ Can train on `dataset://video-features/batch-001` anytime

***

### Best Practices

#### Use Readonly for Input Data

```yaml
# ✅ Good: Readonly prevents accidental modifications
mounts:
  - destination: /mnt/training-data
    readonly: true
```

```yaml
# ⚠️ Avoid: Writeable unless necessary
mounts:
  - destination: /mnt/training-data
    readonly: false
```

***

#### Always Version Final Results

```python
# ❌ Bad: Only use file share, nothing versioned
results = process_data('/mnt/fileshare-data/')
results.save('/mnt/fileshare-output/results.pkl')  # NOT versioned

# ✅ Good: File share for input, outputs for results
results = process_data('/mnt/fileshare-data/')
results.save('/valohai/outputs/results.pkl')  # VERSIONED
```

***

#### Organize Your File Share Structure

```
/share1/
├── raw-data/
│   ├── images/
│   ├── videos/
│   └── text/
├── cache/
│   └── preprocessing/
├── team-workspace/
│   ├── experiments/
│   └── notebooks/
└── temp/
```

Clear organization makes mounting and access control easier.

***

#### Handle Mount Errors

```python
import os
import sys

FILESHARE_PATH = '/mnt/fileshare-data/'

# Verify mount is accessible
if not os.path.exists(FILESHARE_PATH):
    print(f"ERROR: File share mount {FILESHARE_PATH} not accessible")
    print("Possible causes:")
    print("  - Wrong connection string in mount configuration")
    print("  - Wrong password or username in mount configuration")
    print("  - Network connectivity issue")
    sys.exit(1)

# Verify expected data exists
expected_dir = os.path.join(FILESHARE_PATH, 'datasets')
if not os.path.exists(expected_dir):
    print(f"WARNING: Expected directory not found: {expected_dir}")
    print(f"Available: {os.listdir(FILESHARE_PATH)}")

print(f"File share mount verified: {FILESHARE_PATH}")
```

***

### Maintaining Reproducibility

> ⚠️ **Critical:** Azure Files file share data can change between executions. Always save processed results to `/valohai/outputs/` for versioning.

**The problem:**

```python
# Today: Process data from file share
data = load_from_fileshare('/mnt/fileshare-data/')
model = train(data)

# Next week: Someone updates file share data
# Retraining gives different results
# Can't reproduce original model
```

**The solution:**

```python
# Load from file share (current state)
data = load_from_fileshare('/mnt/fileshare-data/')

# Save snapshot to versioned outputs
data.to_csv('/valohai/outputs/training_snapshot.csv')

# Create dataset version
metadata = {
    "training_snapshot.csv": {
        "valohai.dataset-versions": [{
            "uri": "dataset://training-data/2024-01-15"
        }]
    }
}

# Train on versioned snapshot in next execution
# Can reproduce exactly anytime
```

**See:** [Access Network Storage](https://docs.valohai.com/data/data-nfs) for complete patterns.

***

### Related Pages

* [Access Network Storage](https://docs.valohai.com/data/data-nfs) — Overview and when to use NFS
* [AWS Elastic File System](https://docs.valohai.com/data/data-nfs/aws-efs) — AWS equivalent
* [Google Cloud Filestore](https://docs.valohai.com/data/data-nfs/google-filestore) — GCP equivalent
* [On-Premises NFS](https://docs.valohai.com/data/data-nfs/onprem-nfs) — Mount on-prem storage
* [Load Data in Jobs](https://docs.valohai.com/data/data-versioning/load-files-in-jobs) — Alternative: Valohai's versioned inputs

***

### Next Steps

* Set up Azure Files file share in your Azure resource group (or use existing)
* Get file share information for connecting to it
* Create test execution mounting the file share
* Build [pipeline](https://docs.valohai.com/pipelines): mount → process → [save to outputs](https://docs.valohai.com/data/data-versioning/save-files-from-jobs)
* Monitor file share performance in Azure console


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.valohai.com/data/data-nfs/azure-files.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
