genomics-pipeline

Step 23: Clinical Variant Filter

What This Does

Extracts the small subset of clinically interesting variants from your VEP-annotated VCF. Instead of manually searching through 4-5 million variants, this step produces a focused list of ~200-500 variants that are rare AND functionally impactful.

Why

The biggest challenge after running VEP annotation is: “I have millions of variants, what do I look at?” This step solves that by applying conservative filters to surface the variants most likely to be medically relevant.

Tool

bcftools + bcftools +split-vep plugin (parses VEP CSQ fields structurally — no grep)

Docker Image

staphb/bcftools:1.21

Input

VEP-annotated VCF from step 13: ${GENOME_DIR}/${SAMPLE}/vep/${SAMPLE}_vep.vcf (or .vcf.gz)

Command

./scripts/23-clinical-filter.sh your_name

What Gets Filtered

The script produces up to three variant sets (depending on VEP annotations available) that are merged:

HIGH Impact Variants

Stop-gained (premature stop codon — breaks the protein)
Frameshift insertions/deletions (shifts reading frame — breaks the protein)
Splice donor/acceptor (disrupts splicing — breaks the protein)
Start-lost (no translation initiation)

Expected count: 100-200 per genome. Uses bcftools +split-vep -s worst to select the most severe consequence per variant.

Rare MODERATE Impact Variants

Missense variants (amino acid change) with gnomAD allele frequency < 1%
In-frame insertions/deletions with gnomAD AF < 1%

Expected count: 200-400 per genome after frequency filtering. If VEP output lacks gnomAD frequencies (--af_gnomade), all MODERATE variants are included.

ClinVar Pathogenic/Likely Pathogenic (Conditional)

Variants with CLIN_SIG containing “pathogenic” (covers both pathogenic and likely_pathogenic)
Only runs if VEP was run with --everything or --check_existing (which populates the CLIN_SIG field)
Does not use -s worst — a variant is included if ANY transcript annotation has a pathogenic ClinVar entry

Expected count: 10-50 per genome (depends on ClinVar version).

Output

File	Contents	Size
`${SAMPLE}_clinical.vcf.gz`	Combined clinically interesting VCF	< 5 MB
`${SAMPLE}_clinical_summary.tsv`	Human-readable tab-delimited table	< 1 MB
`${SAMPLE}_high_impact.vcf.gz`	HIGH impact variants only	< 2 MB
`${SAMPLE}_rare_moderate.vcf.gz`	Rare MODERATE variants only	< 3 MB
`${SAMPLE}_clinvar_pathogenic.vcf.gz`	ClinVar P/LP only (if CLIN_SIG available)	< 1 MB

Runtime

~5-10 minutes (I/O-bound, reading the large VEP VCF)

How to Use the Output

Quick look at the summary

# View the most important variants
column -t ${GENOME_DIR}/${SAMPLE}/clinical/${SAMPLE}_clinical_summary.tsv | head -20

Cross-reference with ClinVar

# Find which clinical variants are also in ClinVar
docker run --rm -v "${GENOME_DIR}:/genome" staphb/bcftools:1.21 \
  bcftools isec -n=2 -w1 \
    /genome/${SAMPLE}/clinical/${SAMPLE}_clinical.vcf.gz \
    /genome/clinvar/clinvar.vcf.gz \
    -Oz -o /genome/${SAMPLE}/clinical/${SAMPLE}_clinical_clinvar.vcf.gz

Load in a genome browser

The _clinical.vcf.gz file is small enough to load in IGV Web or gene.iobio for visual inspection.

Limitations

This is a computational filter, not a clinical interpretation
Some pathogenic variants are LOW impact (e.g., synonymous variants affecting splicing, regulatory variants) and will be missed by this filter
gnomAD frequency filtering depends on VEP having annotated the gnomAD fields correctly
Always cross-reference findings with ClinVar and consult a professional for clinical decisions

Notes

No additional Docker images required — uses the same bcftools image as other steps
Uses bcftools +split-vep to parse VEP’s pipe-delimited CSQ annotation structurally (not grep)
The VEP VCF is compressed and indexed automatically if needed
PASS filter is applied to exclude low-quality variant calls
Available CSQ subfields are auto-detected — ClinVar and gnomAD filters are skipped gracefully if not present

This site is open source. Improve this page.