No single variant caller is universally best. Each tool uses a different algorithm – deep learning (DeepVariant), local haplotype assembly (GATK HaplotypeCaller), Bayesian haplotype-based calling (FreeBayes), or heuristic scoring (Strelka2) – and each makes different tradeoffs between precision and recall. A variant found by one caller may be missed by another, and vice versa.
Benchmarking lets you:
For background on caller performance differences, see:
The pipeline ships with four variant callers. Each writes output to a separate directory so they can run in parallel without conflicts.
| Script | Caller | Output Directory | Description |
|---|---|---|---|
scripts/03-deepvariant.sh |
DeepVariant 1.6.0 | vcf/ |
Default caller. Deep learning model, highest precision and F1. |
scripts/03a-gatk-haplotypecaller.sh |
GATK HaplotypeCaller 4.6.1 | vcf_gatk/ |
Gold standard in clinical labs. Good precision/recall balance, GVCF support. |
scripts/03b-freebayes.sh |
FreeBayes 1.3.6 | vcf_freebayes/ |
Bayesian caller. Highest sensitivity, most false positives, single-threaded. |
scripts/03c-strelka2-germline.sh |
Strelka2 2.9.10 | vcf_strelka2/ |
Fast heuristic caller (SNVs + indels). Best with BWA-MEM2 alignments. |
All four scripts accept the same arguments:
export GENOME_DIR=/path/to/your/data
./scripts/03-deepvariant.sh your_sample
./scripts/03a-gatk-haplotypecaller.sh your_sample
./scripts/03b-freebayes.sh your_sample
./scripts/03c-strelka2-germline.sh your_sample
GATK and FreeBayes also accept an INTERVALS environment variable to restrict calling to a specific region (see the chr22 example below). Strelka2 and TIDDIT always process the full genome.
An alternative aligner is also available:
| Script | Aligner | Output Directory | Description |
|---|---|---|---|
scripts/02-alignment.sh |
minimap2 2.28 | aligned/ |
Default. Faster, good for germline WGS. |
scripts/02a-alignment-bwamem2.sh |
BWA-MEM2 2.2.1 | aligned_bwamem2/ |
Produces XS tags needed by Strelka2. Slightly more accurate for somatic calling. |
There are two ways to evaluate caller performance:
Compare two or more caller VCFs against each other to measure agreement. This tells you how much the callers agree, but not which one is “right.”
Use bcftools isec to compute the intersection and unique calls:
SAMPLE=your_sample
GENOME_DIR=/path/to/your/data
BCFTOOLS_IMAGE="staphb/bcftools:1.21"
# Compare DeepVariant vs GATK (PASS variants only)
mkdir -p "${GENOME_DIR}/${SAMPLE}/benchmark"
docker run --rm -v "${GENOME_DIR}:/genome" "$BCFTOOLS_IMAGE" \
bcftools isec -p /genome/${SAMPLE}/benchmark/dv_vs_gatk \
-f PASS \
/genome/${SAMPLE}/vcf/${SAMPLE}.vcf.gz \
/genome/${SAMPLE}/vcf_gatk/${SAMPLE}.vcf.gz
This produces four files in the output directory:
| File | Contents |
|---|---|
0000.vcf |
Variants unique to caller A (DeepVariant) |
0001.vcf |
Variants unique to caller B (GATK) |
0002.vcf |
Shared variants (from caller A’s perspective) |
0003.vcf |
Shared variants (from caller B’s perspective) |
sites.txt |
Per-site presence/absence table |
Compute Jaccard similarity (shared / union):
SHARED=$(grep -c -v '^#' "${GENOME_DIR}/${SAMPLE}/benchmark/dv_vs_gatk/0002.vcf")
UNIQUE_A=$(grep -c -v '^#' "${GENOME_DIR}/${SAMPLE}/benchmark/dv_vs_gatk/0000.vcf")
UNIQUE_B=$(grep -c -v '^#' "${GENOME_DIR}/${SAMPLE}/benchmark/dv_vs_gatk/0001.vcf")
UNION=$((SHARED + UNIQUE_A + UNIQUE_B))
echo "Jaccard: $(echo "scale=4; $SHARED / $UNION" | bc)"
Compare caller output against a known-correct truth set. This gives you absolute precision, recall, and F1 scores.
Important: Truth set benchmarking is only valid when the query VCF was generated from the same biological sample as the truth set. If the truth set is HG002, you must sequence HG002 DNA, align it, call variants, and then compare those calls against the HG002 truth VCF. Running hap.py with your personal sample against the HG002 truth set produces meaningless precision/recall numbers because the “true” variants are different for every individual.
The Genome in a Bottle (GIAB) consortium publishes validated truth sets for several reference samples. The most widely used is HG002 (Ashkenazi Jewish male), which has the most comprehensive high-confidence calls.
GENOME_DIR=/path/to/your/data
mkdir -p "${GENOME_DIR}/giab"
# HG002 truth VCF (GRCh38, v4.2.1)
wget -c -P "${GENOME_DIR}/giab" \
"https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/latest/GRCh38/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz"
wget -c -P "${GENOME_DIR}/giab" \
"https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/latest/GRCh38/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi"
# HG002 high-confidence regions BED
wget -c -P "${GENOME_DIR}/giab" \
"https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/latest/GRCh38/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed"
The BED file defines the regions where the truth set is confident. Variants outside these regions are excluded from benchmarking because the truth status is unknown.
The pipeline’s quick-test.md uses NA12878 (HG001), which is also a valid truth set:
wget -c -P "${GENOME_DIR}/giab" \
"https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/latest/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz"
HG002 is preferred for benchmarking because its truth set covers more difficult genomic regions.
${GENOME_DIR}/giab/
HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz # Truth VCF
HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi # Truth VCF index
HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed # High-confidence BED
hap.py (Illumina) is the standard benchmarking tool for SNP and indel callers. It decomposes complex variants, performs genotype matching, and reports precision/recall/F1 stratified by variant type.
# IMPORTANT: SAMPLE must be the GIAB sample that matches the truth set.
# If using HG002 truth, you must have sequenced and called variants on HG002.
SAMPLE=HG002
GENOME_DIR=/path/to/your/data
TRUTH_VCF="/genome/giab/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz"
CONF_BED="/genome/giab/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed"
REF="/genome/reference/Homo_sapiens_assembly38.fasta"
# Benchmark DeepVariant against truth set
docker run --rm \
--cpus 4 --memory 16g \
-v "${GENOME_DIR}:/genome" \
jmcdani20/hap.py:v0.3.12 \
/opt/hap.py/bin/hap.py \
"$TRUTH_VCF" \
"/genome/${SAMPLE}/vcf/${SAMPLE}.vcf.gz" \
-r "$REF" \
-f "$CONF_BED" \
-o "/genome/${SAMPLE}/benchmark/deepvariant_vs_truth" \
--threads 4
# Repeat for GATK
docker run --rm \
--cpus 4 --memory 16g \
-v "${GENOME_DIR}:/genome" \
jmcdani20/hap.py:v0.3.12 \
/opt/hap.py/bin/hap.py \
"$TRUTH_VCF" \
"/genome/${SAMPLE}/vcf_gatk/${SAMPLE}.vcf.gz" \
-r "$REF" \
-f "$CONF_BED" \
-o "/genome/${SAMPLE}/benchmark/gatk_vs_truth" \
--threads 4
# Repeat for FreeBayes
docker run --rm \
--cpus 4 --memory 16g \
-v "${GENOME_DIR}:/genome" \
jmcdani20/hap.py:v0.3.12 \
/opt/hap.py/bin/hap.py \
"$TRUTH_VCF" \
"/genome/${SAMPLE}/vcf_freebayes/${SAMPLE}.vcf.gz" \
-r "$REF" \
-f "$CONF_BED" \
-o "/genome/${SAMPLE}/benchmark/freebayes_vs_truth" \
--threads 4
Each hap.py run produces several output files:
| File | Contents |
|---|---|
*.summary.csv |
Precision, recall, F1 by variant type (SNP, INDEL) |
*.extended.csv |
Detailed metrics by quality tier and subtype |
*.vcf.gz |
Annotated VCF with TP/FP/FN labels per variant |
*.roc.* |
ROC curve data for quality-score-based filtering |
| Metric | What It Measures | Formula |
|---|---|---|
| Precision | How many called variants are real (low = too many false positives) | TP / (TP + FP) |
| Recall | How many real variants were found (low = missing real variants) | TP / (TP + FN) |
| F1 | Harmonic mean of precision and recall (balanced accuracy) | 2 * (Precision * Recall) / (Precision + Recall) |
| Jaccard | Overlap between two call sets (pairwise mode) | Shared / (Shared + Unique_A + Unique_B) |
Where:
These are representative numbers from published benchmarks and real pipeline runs. Your results will vary depending on coverage, sample quality, and alignment.
| Caller | SNP Precision | SNP Recall | SNP F1 | Indel Precision | Indel Recall | Indel F1 |
|---|---|---|---|---|---|---|
| DeepVariant 1.6 | 0.999+ | 0.998+ | 0.999 | 0.995+ | 0.993+ | 0.994 |
| GATK HC 4.6 | 0.998 | 0.997 | 0.998 | 0.985 | 0.980 | 0.983 |
| FreeBayes 1.3 | 0.990 | 0.998 | 0.994 | 0.950 | 0.970 | 0.960 |
Pairwise concordance (Jaccard similarity, PASS SNPs only):
| Pair | Typical Jaccard |
|---|---|
| DeepVariant vs GATK | 0.95 – 0.98 |
| DeepVariant vs FreeBayes | 0.88 – 0.93 |
| GATK vs FreeBayes | 0.87 – 0.92 |
Key observations:
Running all four callers on the full genome takes 1-12 hours each (FreeBayes being the slowest at 8-12h single-threaded, Strelka2 the fastest at ~1h). For a quick script validation, you can restrict GATK and FreeBayes to chromosome 22 only (~1-2% of the genome, takes 3-8 minutes per caller), but full-genome runs are strongly recommended for meaningful benchmarking.
export GENOME_DIR=/path/to/your/data
export SAMPLE=your_sample
# DeepVariant does not support INTERVALS — use the full VCF and filter afterward
# (or run on the full genome; it is already fast enough on modern hardware)
# GATK on chr22 only
INTERVALS=chr22 ./scripts/03a-gatk-haplotypecaller.sh "$SAMPLE"
# FreeBayes on chr22 only
INTERVALS=chr22 ./scripts/03b-freebayes.sh "$SAMPLE"
For pairwise comparison (Step 2), extract chr22 from the DeepVariant full-genome VCF so both inputs cover the same region:
docker run --rm -v "${GENOME_DIR}:/genome" staphb/bcftools:1.21 \
bcftools view -r chr22 \
"/genome/${SAMPLE}/vcf/${SAMPLE}.vcf.gz" \
-Oz -o "/genome/${SAMPLE}/vcf/${SAMPLE}_chr22.vcf.gz"
docker run --rm -v "${GENOME_DIR}:/genome" staphb/bcftools:1.21 \
bcftools index -t "/genome/${SAMPLE}/vcf/${SAMPLE}_chr22.vcf.gz"
Note: The chr22 extract is only needed for Step 2 (pairwise
bcftools isec). Step 3 (hap.py truth) uses the full-genome VCF with-l chr22, which hap.py handles internally.
BCFTOOLS_IMAGE="staphb/bcftools:1.21"
# DeepVariant vs GATK on chr22
docker run --rm -v "${GENOME_DIR}:/genome" "$BCFTOOLS_IMAGE" \
bcftools isec -p /genome/${SAMPLE}/benchmark/chr22_dv_vs_gatk \
-f PASS \
/genome/${SAMPLE}/vcf/${SAMPLE}_chr22.vcf.gz \
/genome/${SAMPLE}/vcf_gatk/${SAMPLE}.vcf.gz
# DeepVariant vs FreeBayes on chr22
docker run --rm -v "${GENOME_DIR}:/genome" "$BCFTOOLS_IMAGE" \
bcftools isec -p /genome/${SAMPLE}/benchmark/chr22_dv_vs_fb \
-f PASS \
/genome/${SAMPLE}/vcf/${SAMPLE}_chr22.vcf.gz \
/genome/${SAMPLE}/vcf_freebayes/${SAMPLE}.vcf.gz
# Benchmark each caller's chr22 output against GIAB truth set
for CALLER_DIR in vcf vcf_gatk vcf_freebayes; do
LABEL=$(basename "$CALLER_DIR")
docker run --rm \
--cpus 4 --memory 16g \
-v "${GENOME_DIR}:/genome" \
jmcdani20/hap.py:v0.3.12 \
/opt/hap.py/bin/hap.py \
/genome/giab/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \
"/genome/${SAMPLE}/${CALLER_DIR}/${SAMPLE}.vcf.gz" \
-r /genome/reference/Homo_sapiens_assembly38.fasta \
-f /genome/giab/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
-o "/genome/${SAMPLE}/benchmark/${LABEL}_vs_truth_chr22" \
--threads 4 \
-l chr22
done
# Strelka2 has a different output path structure
docker run --rm \
--cpus 4 --memory 16g \
-v "${GENOME_DIR}:/genome" \
jmcdani20/hap.py:v0.3.12 \
/opt/hap.py/bin/hap.py \
/genome/giab/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \
"/genome/${SAMPLE}/vcf_strelka2/results/variants/variants.vcf.gz" \
-r /genome/reference/Homo_sapiens_assembly38.fasta \
-f /genome/giab/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
-o "/genome/${SAMPLE}/benchmark/vcf_strelka2_vs_truth_chr22" \
--threads 4 \
-l chr22
echo "=== DeepVariant ==="
cat "${GENOME_DIR}/${SAMPLE}/benchmark/vcf_vs_truth_chr22.summary.csv"
echo "=== GATK ==="
cat "${GENOME_DIR}/${SAMPLE}/benchmark/vcf_gatk_vs_truth_chr22.summary.csv"
echo "=== FreeBayes ==="
cat "${GENOME_DIR}/${SAMPLE}/benchmark/vcf_freebayes_vs_truth_chr22.summary.csv"
echo "=== Strelka2 ==="
cat "${GENOME_DIR}/${SAMPLE}/benchmark/vcf_strelka2_vs_truth_chr22.summary.csv"
Note: The file names above (
vcf_vs_truth_chr22.*) are from the manual hap.py examples. The automatedbenchmark-variants.sh --truthscript useshappy_<Caller>.*naming (e.g.,happy_DeepVariant.summary.csv).
After a full benchmarking run, expect these files in ${GENOME_DIR}/${SAMPLE}/benchmark/:
benchmark/
# Pairwise mode (benchmark-variants.sh without --truth):
isec_DeepVariant_vs_GATK/ # Pairwise: DeepVariant vs GATK
0000.vcf # Unique to DeepVariant
0001.vcf # Unique to GATK
0002.vcf # Shared (DV perspective)
0003.vcf # Shared (GATK perspective)
sites.txt # Per-site presence table
isec_DeepVariant_vs_FreeBayes/ # Pairwise: DeepVariant vs FreeBayes
...
isec_GATK_vs_FreeBayes/ # Pairwise: GATK vs FreeBayes
...
summary.txt # Human-readable results table
comparison.tsv # Machine-readable TSV
# Truth mode (benchmark-variants.sh --truth <vcf> --regions <bed>):
happy_DeepVariant.summary.csv # hap.py: DeepVariant precision/recall
happy_DeepVariant.extended.csv
happy_DeepVariant.vcf.gz # TP/FP/FN annotated VCF
happy_GATK.summary.csv # hap.py: GATK precision/recall
happy_GATK.extended.csv
happy_GATK.vcf.gz
happy_FreeBayes.summary.csv # hap.py: FreeBayes precision/recall
happy_FreeBayes.extended.csv
happy_FreeBayes.vcf.gz
summary.txt # Human-readable results table
comparison.tsv # Machine-readable TSV
benchmark-variants.sh script filters to PASS variants and normalizes VCFs with bcftools norm -m-both before running bcftools isec. This decomposes MNPs and left-aligns indels for fairer comparison. Without this, representation differences between callers (especially FreeBayes) depress Jaccard values.bcftools isec for truth set evaluation.These are actual results from a 30X WGS sample (Intel i5-14500, 64GB RAM).
| Caller | Total | SNPs | Indels | MNPs/Other |
|---|---|---|---|---|
| DeepVariant 1.6.0 | 5,558,877 | — | — | — |
| GATK HC 4.6.1 | 4,662,941 | 3,829,614 | 836,747 | 0 |
| FreeBayes 1.3.6 | 20,059,293 | 18,825,881 | 867,375 | 625,201 |
| Strelka2 2.9.10 | 5,565,634 | — | — | — |
| TIDDIT 3.9.5 | 5,346 SVs | 2,299 DEL | 639 DUP | 204 INV |
FreeBayes calls 4x more variants than DeepVariant. The ~16M FreeBayes-unique calls are overwhelmingly false positives. GATK is the most conservative. Strelka2 and DeepVariant are very close in total count.
Note: These numbers were computed on raw (unfiltered, unnormalized) VCFs. The benchmark-variants.sh script now applies PASS filtering and bcftools norm -m-both normalization before comparison, which will produce higher Jaccard values (especially for DV vs GATK, expected >0.95 for PASS SNPs).
| Caller A | Caller B | Shared | A Unique | B Unique | Jaccard |
|---|---|---|---|---|---|
| DeepVariant | GATK | 4,492,079 | 1,072,475 | 170,862 | 0.783 |
| DeepVariant | FreeBayes | 3,772,065 | 1,792,489 | 16,287,228 | 0.173 |
| GATK | FreeBayes | 3,495,334 | 1,167,607 | 16,563,959 | 0.165 |
Key observations:
| Caller | Runtime | Threads | Peak Memory |
|---|---|---|---|
| DeepVariant 1.6.0 | ~3-5 hours | 8 | 32 GB |
| GATK HC 4.6.1 | 8.6 hours (518 min) | 8 | 4.3 GB |
| FreeBayes 1.3.6 | 9.3 hours | 1 (single-threaded) | 12.8 GB |
| Strelka2 2.9.10 | 72 min | 8 | 1 GB |
| TIDDIT 3.9.5 | 7 min | 4 | <1 GB |
FreeBayes is the bottleneck — single-threaded with no parallelism flag. It also requires 32 GB memory for full-genome runs (peaked at 12.8 GB but grows unpredictably through complex regions). Plan accordingly when running all callers.