Annotates every variant in the VCF with gene name, consequence type, predicted impact, pathogenicity scores (SIFT, PolyPhen), and population allele frequencies. This is the most comprehensive single annotation step in the pipeline.
Raw VCF variants are just genomic coordinates and genotypes. VEP transforms them into biologically interpretable annotations — which gene is affected, what the functional consequence is, how rare the variant is in the population, and whether it is predicted damaging.
ensemblorg/ensembl-vep:release_112.0
SAMPLE=your_sample
GENOME_DIR=/path/to/your/data
docker run --rm \
-v ${GENOME_DIR}/${SAMPLE}/vcf:/data \
-v ${GENOME_DIR}/vep_cache:/vep_cache \
ensemblorg/ensembl-vep:release_112.0 \
vep \
--input_file /data/${SAMPLE}.vcf.gz \
--output_file /data/${SAMPLE}_vep.vcf \
--vcf \
--offline \
--cache \
--dir_cache /vep_cache \
--assembly GRCh38 \
--fork 4 \
--sift b \
--polyphen b \
--af \
--af_gnomade \
--canonical \
--symbol \
--force_overwrite
# Output: VCF with CSQ INFO field containing all annotations
CSQ INFO field (pipe-delimited sub-fields)--tab instead of --vcf for tab-delimited output (easier to parse manually)SYMBOL, Consequence, IMPACT, SIFT, PolyPhen, gnomADe_AF, CANONICALAfter annotation, filter to actionable variants:
# Extract HIGH and MODERATE impact variants with AF <1%
docker run --rm \
-v ${GENOME_DIR}/${SAMPLE}/vcf:/data \
staphb/bcftools:1.21 \
bcftools view -i 'INFO/CSQ[*] ~ "HIGH" || INFO/CSQ[*] ~ "MODERATE"' \
/data/${SAMPLE}_vep.vcf \
> /data/${SAMPLE}_vep_filtered.vcf
--fork 4 enables parallelism — increase if more cores are available--canonical restricts to canonical transcripts, reducing redundant annotations per variant--sift b and --polyphen b output both prediction and score (e.g., deleterious(0.01))--af_gnomade adds gnomAD exomes frequency — use --af_gnomad (without ‘e’) for gnomAD genomes