QuartumSE Data Storage Conventions¶

This document explains the data directory structure and when to use each directory. By default, ShadowEstimator writes manifests and Parquet files under ./data unless you override data_dir.

Directory Overview¶

QuartumSE/
├── data/                  # Production experiment data
├── validation_data/       # Phase 1 validation & smoke tests
├── demo_data/             # Notebook demos & tutorials
├── notebook_data/         # Interactive notebook experimentation
└── experiments/validation/archived_runs/  # Archived validation results

When to Use Each Directory¶

`data/` - Production Experiments¶

Use for: - Final, production-quality experiments - Data intended for publication or reports - Long-term archival - Experiments in workstreams C/O/B/M (Chemistry, Optimization, Benchmarking, Metrology)

Examples:

estimator = ShadowEstimator(backend="ibm:ibm_torino", data_dir="data")

Retention: Keep indefinitely, archive carefully

`validation_data/` - Phase 1 Validation & Testing¶

Use for: - Hardware validation experiments (S-T01, S-T02, etc.) - Smoke tests on IBM hardware - SSR verification experiments - Phase 1 exit criteria validation - Shadow experiment scripts (experiments/shadows/*/run_*.py)

Scripts/notebooks that use this: - experiments/shadows/preliminary_test/run_smoke_test.py - experiments/validation/hardware_validation.py - experiments/shadows/extended_ghz/run_ghz_extended.py - experiments/shadows/parallel_bell_pairs/run_bell_pairs.py - Hardware sections of notebooks/comprehensive_test_suite.ipynb

Examples:

estimator = ShadowEstimator(backend="ibm:ibm_torino", data_dir="validation_data")

Retention: - Keep during Phase 1 validation period - Archive successful runs to experiments/validation/archived_runs/ - Move final validation results to data/ for publication

`demo_data/` - Demos & Tutorials¶

Use for: - Notebook demonstrations - Quickstart examples - Tutorial walkthroughs - Non-critical testing

Notebooks that use this: - notebooks/quickstart_shot_persistence.ipynb - notebooks/comprehensive_test_suite.ipynb - notebooks/noise_aware_shadows_demo.ipynb

Examples:

estimator = ShadowEstimator(backend=AerSimulator(), data_dir="demo_data")

Retention: Ephemeral - safe to delete at any time

`notebook_data/` - Interactive Notebooks¶

Use for: - Jupyter notebook experimentation - Development and debugging - Exploratory data analysis - One-off interactive tests

Examples:

estimator = ShadowEstimator(backend="ibm:ibm_torino", data_dir="notebook_data")

Retention: Ephemeral - safe to delete at any time

Directory Structure¶

All data directories follow the same subdirectory structure:

{data_dir}/
├── manifests/          # Provenance manifests (JSON)
│   └── {experiment_id}.json
├── shots/              # Raw measurement data (Parquet)
│   └── {experiment_id}.parquet
├── reports/            # Generated reports (HTML/PDF)
│   └── {experiment_id}_report.html
└── calibrations/       # MEM confusion matrices (optional)
    └── {experiment_id}_confusion.json

See data/README.md in the repository for detailed schema documentation.

Git Tracking¶

All data directories are git-ignored except for: - ✅ README.md files (documentation) - ✅ .gitkeep files (preserve empty directories) - ❌ JSON manifests (ignored) - ❌ Parquet shot data (ignored) - ❌ HTML/PDF reports (ignored)

Reason: Experimental data files are large and change frequently. Only code and documentation are version-controlled.

Quick Commands¶

Create all directories:¶

mkdir -p data/{manifests,shots,reports,calibrations}
mkdir -p validation_data/{manifests,shots,reports,calibrations}
mkdir -p demo_data
mkdir -p notebook_data

Clean validation data:¶

rm -rf validation_data/{manifests,shots,reports,calibrations}/*

Clean demo/notebook data:¶

rm -rf demo_data/* notebook_data/*

List all experiments by directory:¶

ls -lh data/manifests/           # Production
ls -lh validation_data/manifests/ # Validation
ls -lh demo_data/manifests/      # Demos

Check total data usage:¶

du -sh data/ validation_data/ demo_data/ notebook_data/

Storage Estimates¶

Phase 1 (6 validation experiments):¶

validation_data/: ~1.6 MB
data/: ~0 MB (not used yet)

Phase 2+ (Production workloads):¶

data/: ~50-100 MB per workstream (C/O/B/M)
validation_data/: Archived after Phase 1

Per Experiment:¶

Manifest: ~10 KB
Shot data (500 shots, 3 qubits): ~50-100 KB
Report: ~50 KB
Total per experiment: ~100-200 KB

Best Practices¶

Always specify data_dir explicitly in scripts:

# Good
estimator = ShadowEstimator(..., data_dir="validation_data")

# Bad (default may change)
estimator = ShadowEstimator(...)

Use appropriate directory for purpose:
Production → data/
Validation/testing → validation_data/
Demos → demo_data/
Notebooks → notebook_data/

Archive validation results after Phase 1:

cp validation_data/manifests/final_validation.json \
   experiments/validation/archived_runs/results_final.txt

Clean temporary directories regularly:

# Weekly cleanup
rm -rf demo_data/* notebook_data/*

Check storage usage before long experiment runs:

df -h .  # Check available disk space
du -sh validation_data/  # Check current usage

FAQ¶

Q: Which directory should I use for the preliminary smoke test? A: validation_data/ - it's part of Phase 1 validation.

Q: Can I move experiments between directories? A: Yes! Manifests and shot data are self-contained. Just copy/move the files:

mv validation_data/manifests/{id}.json data/manifests/
mv validation_data/shots/{id}.parquet data/shots/

Q: What if I accidentally delete a data directory? A: All directories auto-create subdirectories when needed. Just re-run your experiment or manually recreate:

mkdir -p validation_data/{manifests,shots,reports,calibrations}

Q: How do I back up my validation data? A: Copy the entire directory:

cp -r validation_data/ validation_data_backup_$(date +%Y%m%d)

Q: Why are Parquet files ignored by git? A: They're binary files that can be large (MBs) and change frequently. Git is optimized for text files (code, docs).

Last Updated: 2025-10-22 QuartumSE Version: 0.1.0

QuartumSE Data Storage Conventions¶

Directory Overview¶

When to Use Each Directory¶

data/ - Production Experiments¶

validation_data/ - Phase 1 Validation & Testing¶

demo_data/ - Demos & Tutorials¶

notebook_data/ - Interactive Notebooks¶

Directory Structure¶

Git Tracking¶

Quick Commands¶

Create all directories:¶

Clean validation data:¶

Clean demo/notebook data:¶

List all experiments by directory:¶

Check total data usage:¶

Storage Estimates¶

Phase 1 (6 validation experiments):¶

Phase 2+ (Production workloads):¶

Per Experiment:¶

Best Practices¶

FAQ¶

`data/` - Production Experiments¶

`validation_data/` - Phase 1 Validation & Testing¶

`demo_data/` - Demos & Tutorials¶

`notebook_data/` - Interactive Notebooks¶