The pipeline has a Nextflow DSL2 execution path for post-calling interpretation and clinical analysis. It accepts VCF + BAM from any upstream caller (e.g. nf-core/sarek, DRAGEN, the bash alignment scripts) and runs pharmacogenomics, variant annotation, clinical screening, structural variant analysis, and reporting across 6 workflows with 27 modules.
Both execution paths are maintained. The bash scripts (
run-all.sh) remain the simpler option for single-machine use. Nextflow adds automatic parallelism, content-hash resume, and HPC/Singularity support. Both paths produce biologically equivalent results, though output file names and report scope may differ.
curl -s https://get.nextflow.io | bash
sudo mv nextflow /usr/local/bin/
# 1. Create a samplesheet CSV
cat > samplesheet.csv << 'EOF'
sample,vcf,vcf_index,bam,bam_index
sergio,/path/to/sergio.vcf.gz,/path/to/sergio.vcf.gz.tbi,/path/to/sergio_sorted.bam,/path/to/sergio_sorted.bam.bai
EOF
# 2. Run (default tools need no external databases)
nextflow run main.nf \
--input samplesheet.csv \
--reference /path/to/Homo_sapiens_assembly38.fasta \
--outdir ./results \
-profile docker
# 3. To enable database-requiring tools, add them to --tools with their flags:
# --tools '...,vep,slivar,clinical_filter' + --vep_cache /path/to/vep_cache
# --tools '...,cpsr' + --pcgr_data + --vep_cache_cpsr
# --tools '...,clinvar' + --clinvar + --clinvar_index
# --tools '...,expansion_hunter' + --expansion_catalog
Nextflow caches completed steps using content hashes. If a step fails, fix the issue and resume:
nextflow run main.nf -resume [same params as before]
Only the failed and downstream steps re-run.
| Column | Required | Description |
|---|---|---|
sample |
Yes | Sample identifier (used as output directory name) |
vcf |
Yes | Path to bgzipped VCF (.vcf.gz) |
vcf_index |
Yes | Path to tabix index (.vcf.gz.tbi) |
bam |
No* | Path to aligned BAM (needed for BAM-based steps like pypgx) |
bam_index |
No* | Path to BAM index (.bam.bai) |
* BAM is technically optional (VCF-only runs are valid for annotation and PGx), but most default tools (mosdepth, telomere_hunter, cyrius, mito_variants) and opt-in tools (expansion_hunter, hla_typing, pypgx) require BAM input. Provide BAM for full analysis.
If you ran nf-core/sarek for alignment and variant calling, point the samplesheet at sarek’s output files:
sample,vcf,vcf_index,bam,bam_index
sergio,results/variant_calling/deepvariant/sergio/sergio.deepvariant.vcf.gz,results/variant_calling/deepvariant/sergio/sergio.deepvariant.vcf.gz.tbi,results/preprocessing/recalibrated/sergio/sergio.recal.bam,results/preprocessing/recalibrated/sergio/sergio.recal.bam.bai
| Profile | Description |
|---|---|
docker |
Run with Docker containers (default for local) |
singularity |
Run with Singularity/Apptainer (HPC clusters) |
test |
Minimal test with reduced resources |
test_full |
Full-size test with real WGS data |
Combine profiles: -profile docker,test
Default resource limits (tuned for 16-core consumer desktop):
| Parameter | Default | Description |
|---|---|---|
--max_cpus |
16 | Maximum CPUs per process |
--max_memory |
64.GB | Maximum memory per process |
--max_time |
48.h | Maximum wall time per process |
Override for smaller machines:
nextflow run main.nf --max_cpus 8 --max_memory 32.GB [other params]
results/
├── sergio/
│ ├── pharmcat/ # PharmCAT PGx reports (HTML + JSON)
│ ├── clinvar/ # ClinVar pathogenic variant screen
│ ├── pypgx/ # pypgx star allele calling (optional)
│ ├── cpic/ # CPIC drug-gene recommendations (optional)
│ ├── vep/ # VEP + vcfanno enriched VCF (CADD, SpliceAI, REVEL, AlphaMissense)
│ ├── slivar/ # Prioritized variants + compound hets
│ ├── clinical/ # Clinically relevant variant subset
│ ├── cpsr/ # Cancer predisposition report
│ ├── roh/ # Runs of homozygosity
│ ├── prs/ # Polygenic risk scores
│ ├── ancestry/ # Ancestry PCA (optional)
│ ├── mito/ # Mitochondrial haplogroup
│ ├── hla/ # HLA typing
│ ├── expansion_hunter/ # Repeat expansion calls
│ ├── telomere/ # Telomere length estimation
│ ├── coverage/ # Coverage statistics (mosdepth)
│ ├── mito_variants/ # Mitochondrial variant calling
│ ├── cyrius/ # CYP2D6 star allele (Cyrius)
│ ├── manta/ # SV calling (optional)
│ ├── delly/ # SV calling (optional)
│ ├── cnvnator/ # CNV calling (optional)
│ └── *_report.html # Consolidated HTML report (published to sample root)
└── pipeline_info/
├── timeline_*.html
├── report_*.html
├── trace_*.txt
└── dag_*.svg
| Feature | Bash (run-all.sh) |
Nextflow (main.nf) |
|---|---|---|
| Setup complexity | Just Docker | Docker + Java + Nextflow |
| Resume on failure | File-existence checks | Content-hash caching (more robust) |
| Parallelism | Manual (wait, throttle) |
Automatic DAG-based |
| HPC / Singularity | Not supported | Built-in |
| Learning curve | Shell scripting | Nextflow DSL2 + Groovy |
| Target audience | Non-bioinformaticians | Bioinformaticians, HPC users |
Recommendation: If you’re comfortable with bash and running on a single machine, use the bash scripts. If you want automatic parallelism, robust resume, or HPC support, use Nextflow.
This Nextflow pipeline is a post-calling interpretation pipeline, not a FASTQ-to-results pipeline. It accepts VCF + BAM from any upstream caller (e.g. nf-core/sarek, DRAGEN, the bash alignment scripts) and runs pharmacogenomics, annotation, clinical screening, structural variant calling, and reporting. Alignment and primary variant calling are handled upstream.
Both execution paths (bash run-all.sh and Nextflow main.nf) aim for biologically equivalent results — the same clinical conclusions, gene calls, and risk assessments. However, they are not output-identical: file names, directory structure, report formatting, and intermediate files may differ. When in doubt, the bash scripts are the reference implementation.
Several tools require large reference databases that are not automatically downloaded by the pipeline. You must obtain and provide paths for these yourself:
| Parameter | Required by | Size |
|---|---|---|
--vep_cache |
VEP annotation | ~15 GB |
--pcgr_data |
CPSR cancer predisposition | ~20 GB |
--pypgx_bundle |
PyPGx star allele calling | ~2 GB |
--cadd_snv, --spliceai_snv, etc. |
vcfanno score annotation | ~100 GB total |
--gnomad_constraint |
Slivar gene constraint | ~5 MB |
--pgs_scoring |
Polygenic risk scores | varies |
Tools that require external databases (VEP, slivar, clinvar, CPSR, ExpansionHunter, HLA typing, pypgx) will fail at startup if enabled in --tools without their required parameters. Annotation scores (CADD, SpliceAI, REVEL, AlphaMissense, gnomAD constraint) are optional enrichments — vcfanno and slivar degrade gracefully without them.
The --ancestry_ref parameter expects a single VCF file (not a directory). Single-sample PCA without a multi-population reference panel produces mathematically limited results — the module will run but report pca_status: skipped_single_sample. For meaningful ancestry estimation, provide a reference panel VCF containing multiple population samples.
The survivor_merge module uses a simplified bcftools-based heuristic (1kb position binning) rather than the full SURVIVOR or Jasmine algorithm. CNVnator calls (depth-based, no PASS/FAIL marking) are treated equally with paired-end callers in the “2+ callers” consensus. For production SV analysis, consider running SURVIVOR or Jasmine externally.
This pipeline is designed for personal, single-user use on trusted data. Sample names are sanitized (alphanumeric, ., _, - only), and HTML report fields from VCF INFO are escaped to prevent XSS. However, it is not hardened for multi-tenant or untrusted-input scenarios. Do not expose the pipeline or its outputs as a web service without additional security review.
The Cyrius module (CYP2D6 star allele calling) installs cyrius==1.1.1 via pip at runtime because no pre-built container image exists. This requires network access on first run and means Nextflow’s container-only reproducibility guarantee does not fully apply to this module. The version is pinned to avoid floating dependencies. The matching bash script (scripts/21-cyrius.sh) has the same limitation.
The CI test suite validates the stub-testable subset of modules using -stub dry runs (tools that do not require external databases). It does not cover database-dependent tools (vep, cpsr, clinvar, expansion_hunter) or run real bioinformatics tools on real data. Before trusting results from a new installation, run the pipeline on a known sample and compare key outputs (PharmCAT star alleles, ClinVar hit counts, PCA eigenvectors) against expected values.
This pipeline uses nf-core template patterns and tooling for code quality, but is not an official nf-core pipeline (it uses a GPL-3.0 license; nf-core requires MIT).
Individual modules (PharmCAT, pypgx, slivar) will be contributed to nf-core/modules under MIT license for use by the broader community.
This pipeline was created using tools and best practices from the nf-core community (Ewels et al., 2020, Nat Biotechnol). nf-core components used here are released under the MIT license.