Adds pathogenicity scores (CADD, SpliceAI, REVEL, AlphaMissense) to VEP-annotated VCFs using vcfanno’s TOML-driven annotation engine.
VEP (step 13) provides functional impact predictions, but thorough variant interpretation benefits from additional pathogenicity scores. CADD captures deleteriousness across coding and non-coding variants; SpliceAI finds cryptic splice-disrupting variants missed by rule-based methods; REVEL and AlphaMissense score missense variants using ensemble and structural approaches respectively.
vcfanno annotates in a single streaming pass per database, making it much faster than re-running VEP with plugins.
${SAMPLE}_vep.vcf.gz)${GENOME_DIR}/annotations/ (all optional)See docs/00-reference-setup.md for download instructions.
| Database | File | Size | License |
|---|---|---|---|
| CADD v1.7 SNVs | whole_genome_SNVs.tsv.gz + .tbi |
~81.5 GB | Non-commercial |
| CADD v1.7 indels | gnomad.genomes.r4.0.indel.tsv.gz + .tbi |
~1.2 GB | Non-commercial |
| SpliceAI SNVs | spliceai_scores.raw.snv.hg38.vcf.gz + .tbi |
~16 GB | Apache 2.0 |
| SpliceAI indels | spliceai_scores.raw.indel.hg38.vcf.gz + .tbi |
~4 GB | Apache 2.0 |
| REVEL v1.3 | revel_grch38.tsv.gz + .tbi |
~526 MB | Free for research |
| AlphaMissense | AlphaMissense_hg38.tsv.gz + .tbi |
~613 MB | CC BY-NC-SA 4.0 |
All databases are optional. The script detects which files are present and annotates accordingly. Missing databases are silently skipped.
CADD files use bare chromosome names (1, 2, 3) while the pipeline VCFs and other databases use chr-prefixed names (chr1, chr2, chr3). The script handles this with a two-pass approach:
chr prefix from VCF, annotate with CADD, re-add chr prefixIf only chr-prefixed databases are present (no CADD), a single pass is used.
quay.io/biocontainers/vcfanno:0.3.7--he881be0_0
Also uses the bcftools image for bgzip/tabix/chr renaming operations.
export GENOME_DIR=/path/to/your/data
./scripts/30-vcfanno.sh your_sample
| File | Description |
|---|---|
vep/${SAMPLE}_annotated.vcf.gz |
VCF with CADD_PHRED, SpliceAI, REVEL, AM_pathogenicity, AM_class in INFO |
vep/${SAMPLE}_annotated.vcf.gz.tbi |
Tabix index |
~5-15 minutes depending on which databases are present. The two-pass CADD approach adds ~2-3 minutes for the chromosome renaming steps.
| Field | Source | Description |
|---|---|---|
CADD_PHRED |
CADD v1.7 | PHRED-scaled deleteriousness score (higher = more deleterious) |
CADD_PHRED_indel |
CADD v1.7 | CADD score for indels specifically |
SpliceAI |
SpliceAI | Splice impact prediction (delta scores for AG/AL/DG/DL) |
REVEL |
REVEL v1.3 | Ensemble missense pathogenicity score (0-1) |
AM_pathogenicity |
AlphaMissense | Structure-informed missense pathogenicity (0-1) |
AM_class |
AlphaMissense | Classification: benign, ambiguous, or likely_pathogenic |
See docs/interpreting-results.md for score thresholds and clinical interpretation guidance.
# Variants with CADD PHRED >= 20 (top 1% most deleterious)
bcftools view -i 'INFO/CADD_PHRED>=20' ${SAMPLE}_annotated.vcf.gz | head
# AlphaMissense likely pathogenic variants
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%INFO/AM_class\n' \
-i 'INFO/AM_class="likely_pathogenic"' ${SAMPLE}_annotated.vcf.gz
# High REVEL score missense variants (ClinGen moderate evidence)
bcftools view -i 'INFO/REVEL>=0.644' ${SAMPLE}_annotated.vcf.gz