Everything you need to know about disk space, RAM, CPU, and runtime before starting.
| Data | Size | When Created | Can Delete After? |
|---|---|---|---|
| Raw FASTQ (gzipped) | 60-90 GB | You bring this | Keep (original data) |
| Sorted BAM + index | 80-120 GB | Step 2 (alignment) | After all BAM-dependent steps complete |
| VCF + index | 80-200 MB | Step 3 (variant calling) | Keep (needed by many steps) |
| Manta SV VCF | 1-5 MB | Step 4 | Keep |
| AnnotSV TSV | 25-35 MB | Step 5 | Keep |
| ClinVar hits | <1 MB | Step 6 | Keep |
| PharmCAT report | 1-5 MB | Step 7 | Keep |
| ExpansionHunter output | <1 MB | Step 9 | Keep |
| TelomereHunter output | 50-200 MB | Step 10 | Keep |
| VEP annotated VCF | 2-5 GB | Step 13 | Keep (comprehensive annotation) |
| CPSR report + data | 50-200 MB | Step 17 | Keep |
| CNVnator ROOT file + calls | 5-15 GB | Step 18 | ROOT file can be deleted |
| Delly BCF + VCF | 5-20 MB | Step 19 | Keep VCF, delete BCF |
| MToolBox output | 50-200 MB | Step 20 | Keep |
| Subtotal per sample | 150-250 GB |
| Resource | Download Size | Extracted Size | Notes |
|---|---|---|---|
| GRCh38 FASTA + index | ~1 GB | ~3.5 GB | Core reference genome |
| ClinVar VCF + index | ~200 MB | ~200 MB | Updated monthly |
| VEP cache (Ensembl 112) | ~22 GB | ~30 GB | Largest single download |
| PCGR/CPSR data bundle | ~21 GB | ~30 GB | ClinVar + gnomAD + panels |
| Docker images (all 27 steps) | ~10-15 GB | ~10-15 GB | Cached by Docker |
| Subtotal (shared) | ~55 GB | ~80 GB |
| Scenario | Minimum Free Space |
|---|---|
| 1 sample, core steps only (2-3-6-7) | 200 GB |
| 1 sample, full pipeline | 500 GB |
| 2 samples, full pipeline | 1 TB |
| 2 samples + keeping intermediates | 1.5 TB |
Tip: After the pipeline completes, the single largest file is the BAM (80-120 GB per sample). If you’re done with all BAM-dependent steps (4, 9, 10, 15, 16, 18, 19, 20), you can convert to CRAM to save 40-60% space, or delete the BAM entirely if you keep the FASTQ (you can always re-align).
Each pipeline step runs in a Docker container with a --memory limit. Here’s what each step actually needs:
| Step | Memory Limit | Peak Usage | Notes |
|---|---|---|---|
| 2 (minimap2 alignment) | 16 GB | 6-10 GB | minimap2 is RAM-efficient |
| 3 (DeepVariant) | 32 GB | 8-20 GB | Scales with --cpus |
| 4 (Manta) | 8 GB | 4-6 GB | Moderate |
| 6 (ClinVar screen) | 4 GB | 1-2 GB | Light |
| 7 (PharmCAT) | 4 GB | 2-3 GB | Light |
| 9 (ExpansionHunter) | 8 GB | 4-6 GB | Moderate |
| 10 (TelomereHunter) | 8 GB | 4-6 GB | Moderate |
| 13 (VEP) | 16 GB | 4-8 GB | Cache loaded into memory |
| 17 (CPSR) | 8 GB | 4-6 GB | Moderate |
| 18 (CNVnator) | 8 GB | 4-6 GB | ROOT file can be large |
| 19 (Delly) | 8 GB | 4-6 GB | Moderate |
Minimum system RAM: 16 GB (run one step at a time with reduced --memory flags)
Recommended: 32 GB (run multiple steps in parallel)
Ideal: 64 GB (run everything in parallel)
Reducing memory limits: If you have less RAM, edit the
--memoryflag in each script. Most steps will work with less – they’ll just be slower or may fail on edge cases. DeepVariant is the most memory-hungry.
All scripts use --cpus to limit Docker container CPU usage. More cores = faster, but with diminishing returns above 16 cores for most tools.
| Step | Default –cpus | Scales Linearly? | Notes |
|---|---|---|---|
| 2 (minimap2) | 8 | Yes, up to ~16 | I/O bound above 16 cores |
| 3 (DeepVariant) | 8 | Yes, up to ~32 | Most CPU-intensive step |
| 4 (Manta) | 8 | Yes | Already very fast |
| 13 (VEP) | 8 | Yes (–fork) | Can use all available cores |
| 18 (CNVnator) | 4 | Limited | Mostly single-threaded |
| 19 (Delly) | 4 | Limited | Per-chromosome parallelism |
Minimum: 4 cores (very slow but works) Recommended: 16 cores (good balance of speed and availability) No benefit beyond: ~32 cores for any single step
On a 16-core / 32 GB desktop (e.g., AMD Ryzen 9 5950X):
| Step Group | Steps | Runtime | Can Parallelize? |
|---|---|---|---|
| Alignment | 2 | 1-2 hours | No (one BAM per sample) |
| Variant Calling | 3 | 2-4 hours | No (needs BAM from step 2) |
| Quick Analyses | 4, 5, 6, 7, 9, 11, 12, 16 | ~1 hour total | Yes (all independent after step 3) |
| Heavy Annotation | 13, 17 | 2-5 hours total | Yes (both use VCF) |
| Optional SV Callers | 18, 19 | 2-4 hours each | Yes (both use BAM) |
| Optional Mito/Telomere | 10, 20 | 1-2 hours total | Yes (both use BAM) |
| Total (sequential) | All 20 | 12-20 hours | |
| Total (parallelized) | All 20 | 6-10 hours |
After step 3 (variant calling) completes, many steps can run simultaneously:
Step 3 done ──┬──> Steps 4, 6, 7, 9, 11, 12, 16 (quick, ~1 hr total)
├──> Step 13 (VEP, ~2-4 hr)
├──> Step 17 (CPSR, ~30-60 min)
├──> Step 18 (CNVnator, ~2-4 hr) ← These 3 use BAM, need RAM
├──> Step 19 (Delly, ~2-4 hr) ← Run 1-2 at a time
├──> Step 10 (TelomereHunter, ~1 hr)
└──> Step 20 (MToolBox, ~15-30 min)
| Resource | Size | Notes |
|---|---|---|
| GRCh38 reference | ~1 GB | Fast download |
| ClinVar | ~200 MB | Fast download |
| VEP cache | ~22 GB | Slow servers, wget -c recommended for resume |
| PCGR data bundle | ~21 GB | Can be slow |
| Docker images | ~10-15 GB | Pulled automatically by docker run |
| Total | ~55-60 GB |
Offline operation: After the initial setup, the entire pipeline runs offline. No internet required during analysis. All databases and caches are local.
samtools view -C -T reference.fasta input.bam > output.cram
Saves 40-60% (30-50 GB per sample).
.root files (5-15 GB each).bcf files (after converting to VCF)*_singleton.fastq.gz from MToolBox chrM extractionbgzip sample_vep.vcf # Compresses from ~3.5 GB to ~400 MB
| Medium | Suitable For | Notes |
|---|---|---|
| NVMe SSD | Active analysis | Fastest. 10-50x faster than HDD for random reads. |
| SATA SSD | Active analysis | Good performance. Adequate for all pipeline steps. |
| HDD (7200 RPM) | Storage / archive | Adequate for sequential I/O (alignment, VEP). Random access steps (DeepVariant) will be slower. |
| Network storage (NFS/SMB) | Archive only | Too slow for active analysis. Use for long-term storage after pipeline completes. |
| USB external drive | Emergency only | Severely bottlenecks I/O-intensive steps. |
If you don’t have suitable hardware, cloud instances work well:
| Provider | Instance | vCPUs | RAM | Cost/hr | ~Cost per Sample |
|---|---|---|---|---|---|
| AWS | c5.4xlarge | 16 | 32 GB | ~$0.68 | ~$5-8 |
| GCP | n2-standard-16 | 16 | 64 GB | ~$0.78 | ~$6-10 |
| Azure | Standard_D16s_v5 | 16 | 64 GB | ~$0.77 | ~$6-10 |
| Hetzner | CCX33 | 8 | 32 GB | ~$0.18 | ~$2-3 |
Add ~$0.10/GB/month for persistent disk storage. A 500 GB disk costs ~$50/month.
Tip: Use spot/preemptible instances for 60-80% savings. The pipeline is restartable – if your instance gets preempted, just re-run the interrupted step.