Don’t commit 12+ hours and 500 GB to a full pipeline run before verifying everything works. This guide shows how to test with a small public dataset in under 30 minutes.
Chromosome 22 is the smallest autosome (~51 MB). Running the pipeline on chr22 data tests all tools with ~2% of the data, completing in minutes instead of hours.
We’ll use the Genome in a Bottle NA12878 sample (NIST reference standard):
export GENOME_DIR=/path/to/test/data
export SAMPLE=test_na12878
mkdir -p ${GENOME_DIR}/${SAMPLE}/vcf ${GENOME_DIR}/reference
# Download the GRCh38 reference (required, ~3.1 GB — skip if you already have it)
# See docs/00-reference-setup.md for full instructions
# Download a pre-called chr22 VCF from Genome in a Bottle
wget -O ${GENOME_DIR}/${SAMPLE}/vcf/${SAMPLE}.vcf.gz \
"https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/latest/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz"
wget -O ${GENOME_DIR}/${SAMPLE}/vcf/${SAMPLE}.vcf.gz.tbi \
"https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/latest/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi"
Note: This is the full-genome GIAB VCF (~250 MB). For a chr22-only test, extract just chr22:
docker run --rm --user root \
-v "${GENOME_DIR}:/genome" \
staphb/bcftools:1.21 \
bcftools view -r chr22 \
/genome/${SAMPLE}/vcf/${SAMPLE}.vcf.gz \
-Oz -o /genome/${SAMPLE}/vcf/${SAMPLE}_chr22.vcf.gz
docker run --rm --user root \
-v "${GENOME_DIR}:/genome" \
staphb/bcftools:1.21 \
bcftools index -t /genome/${SAMPLE}/vcf/${SAMPLE}_chr22.vcf.gz
If you extracted chr22 above, back up the original and symlink the chr22 extract:
# Preserve the original full VCF (idempotent — skips if already backed up)
cd ${GENOME_DIR}/${SAMPLE}/vcf
if [ ! -f ${SAMPLE}_full.vcf.gz ]; then
mv ${SAMPLE}.vcf.gz ${SAMPLE}_full.vcf.gz
mv ${SAMPLE}.vcf.gz.tbi ${SAMPLE}_full.vcf.gz.tbi
fi
# Point the pipeline at the chr22 extract
ln -sfn ${SAMPLE}_chr22.vcf.gz ${SAMPLE}.vcf.gz
ln -sfn ${SAMPLE}_chr22.vcf.gz.tbi ${SAMPLE}.vcf.gz.tbi
# To restore the full VCF later:
# rm ${SAMPLE}.vcf.gz ${SAMPLE}.vcf.gz.tbi
# mv ${SAMPLE}_full.vcf.gz ${SAMPLE}.vcf.gz
# mv ${SAMPLE}_full.vcf.gz.tbi ${SAMPLE}.vcf.gz.tbi
Then run the VCF-only steps (they expect ${SAMPLE}.vcf.gz):
# ClinVar screen (~1 min)
./scripts/06-clinvar-screen.sh ${SAMPLE}
# PharmCAT (~2 min)
./scripts/07-pharmacogenomics.sh ${SAMPLE}
# ROH analysis (~1 min)
./scripts/11-roh-analysis.sh ${SAMPLE}
# Should see ClinVar hits
ls -la ${GENOME_DIR}/${SAMPLE}/clinvar/
# Should see PharmCAT HTML report (written alongside VCF)
ls -la ${GENOME_DIR}/${SAMPLE}/vcf/*.report.html
# Should see ROH output
ls -la ${GENOME_DIR}/${SAMPLE}/vcf/${SAMPLE}_roh.txt
If all three steps produce output, your Docker setup, reference data, and pipeline scripts are working correctly.
If you want to test the BAM-dependent steps (Manta, CNVnator, Delly, TelomereHunter, indexcov), you need a BAM file. A chr22-only BAM is small enough (~3-4 GB) for a quick test:
# Download chr22 reads for NA12878 from 1000 Genomes
# (This is a ~3 GB download)
mkdir -p ${GENOME_DIR}/${SAMPLE}/aligned
wget -O ${GENOME_DIR}/${SAMPLE}/aligned/${SAMPLE}_sorted.bam \
"https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1000G_2504_highcov_chr22.bam"
# This URL may change. If it doesn't work, see:
# https://www.internationalgenome.org/data-portal/sample/NA12878
# Download any chr22 BAM for GRCh38
Alternative: Extract chr22 from a full BAM (if you already have one):
docker run --rm --user root \
--cpus 4 --memory 4g \
-v "${GENOME_DIR}:/genome" \
staphb/samtools:1.20 \
bash -c "samtools view -b /genome/${SAMPLE}/aligned/${SAMPLE}_sorted.bam chr22 \
> /genome/${SAMPLE}/aligned/${SAMPLE}_chr22.bam && \
samtools index /genome/${SAMPLE}/aligned/${SAMPLE}_chr22.bam"
Then test BAM-dependent steps:
./scripts/04-manta.sh ${SAMPLE} # Manta SVs (~2 min on chr22)
./scripts/16-indexcov.sh ${SAMPLE} # Coverage QC (~1 sec)
| Step | Expected Runtime | Expected Output |
|---|---|---|
| ClinVar screen | < 1 min | 0-5 pathogenic hits for NA12878 |
| PharmCAT | 1-3 min | HTML report with gene calls |
| ROH analysis | < 1 min | ROH segments file |
| Step | Expected Runtime | Expected Output |
|---|---|---|
| Manta | 1-3 min | Small VCF with chr22 SVs |
| indexcov | < 5 sec | Coverage plots for chr22 |
“No such file” errors: Check that GENOME_DIR is set and the downloaded files are in the expected paths.
Docker image pull fails: Some images are on quay.io, which occasionally has downtime. Wait and retry, or check the exact image tag in the step’s documentation.
0 ClinVar hits on test data: The GIAB benchmark VCF may not overlap with the ClinVar pathogenic subset if you are using a chr22-only extract. This is expected — the test validates that the pipeline mechanics work, not that there are clinical findings.
PharmCAT produces empty report: The GIAB VCF uses a different sample name internally. PharmCAT should still run but may show “No data” for some genes. This is expected for test data.
You’re ready to run on your real data:
GENOME_DIR to your actual data directorySAMPLE to your actual sample name./scripts/validate-setup.sh $SAMPLE for a comprehensive pre-flight check