genomics-pipeline

Step 13: Variant Effect Predictor (VEP) Annotation

What This Does

Annotates every variant in the VCF with gene name, consequence type, predicted impact, pathogenicity scores (SIFT, PolyPhen), and population allele frequencies. This is the most comprehensive single annotation step in the pipeline.

Why

Raw VCF variants are just genomic coordinates and genotypes. VEP transforms them into biologically interpretable annotations — which gene is affected, what the functional consequence is, how rare the variant is in the population, and whether it is predicted damaging.

Tool

Docker Image

ensemblorg/ensembl-vep:release_112.0

Prerequisites

Command

SAMPLE=your_sample
GENOME_DIR=/path/to/your/data

docker run --rm \
  -v ${GENOME_DIR}/${SAMPLE}/vcf:/data \
  -v ${GENOME_DIR}/vep_cache:/vep_cache \
  ensemblorg/ensembl-vep:release_112.0 \
  vep \
    --input_file /data/${SAMPLE}.vcf.gz \
    --output_file /data/${SAMPLE}_vep.vcf \
    --vcf \
    --offline \
    --cache \
    --dir_cache /vep_cache \
    --assembly GRCh38 \
    --fork 4 \
    --sift b \
    --polyphen b \
    --af \
    --af_gnomade \
    --canonical \
    --symbol \
    --force_overwrite

# Output: VCF with CSQ INFO field containing all annotations

Output Format

Filtering for Clinical Relevance

After annotation, filter to actionable variants:

# Extract HIGH and MODERATE impact variants with AF <1%
docker run --rm \
  -v ${GENOME_DIR}/${SAMPLE}/vcf:/data \
  staphb/bcftools:1.21 \
  bcftools view -i 'INFO/CSQ[*] ~ "HIGH" || INFO/CSQ[*] ~ "MODERATE"' \
    /data/${SAMPLE}_vep.vcf \
    > /data/${SAMPLE}_vep_filtered.vcf

Important Notes