Extracts the small subset of clinically interesting variants from your VEP-annotated VCF. Instead of manually searching through 4-5 million variants, this step produces a focused list of ~200-500 variants that are rare AND functionally impactful.
The biggest challenge after running VEP annotation is: “I have millions of variants, what do I look at?” This step solves that by applying conservative filters to surface the variants most likely to be medically relevant.
bcftools + bcftools +split-vep plugin (parses VEP CSQ fields structurally — no grep)
staphb/bcftools:1.21
${GENOME_DIR}/${SAMPLE}/vep/${SAMPLE}_vep.vcf (or .vcf.gz)./scripts/23-clinical-filter.sh your_name
The script produces up to three variant sets (depending on VEP annotations available) that are merged:
Expected count: 100-200 per genome. Uses bcftools +split-vep -s worst to select the most severe consequence per variant.
Expected count: 200-400 per genome after frequency filtering. If VEP output lacks gnomAD frequencies (--af_gnomade), all MODERATE variants are included.
CLIN_SIG containing “pathogenic” (covers both pathogenic and likely_pathogenic)--everything or --check_existing (which populates the CLIN_SIG field)-s worst — a variant is included if ANY transcript annotation has a pathogenic ClinVar entryExpected count: 10-50 per genome (depends on ClinVar version).
| File | Contents | Size |
|---|---|---|
${SAMPLE}_clinical.vcf.gz |
Combined clinically interesting VCF | < 5 MB |
${SAMPLE}_clinical_summary.tsv |
Human-readable tab-delimited table | < 1 MB |
${SAMPLE}_high_impact.vcf.gz |
HIGH impact variants only | < 2 MB |
${SAMPLE}_rare_moderate.vcf.gz |
Rare MODERATE variants only | < 3 MB |
${SAMPLE}_clinvar_pathogenic.vcf.gz |
ClinVar P/LP only (if CLIN_SIG available) | < 1 MB |
~5-10 minutes (I/O-bound, reading the large VEP VCF)
# View the most important variants
column -t ${GENOME_DIR}/${SAMPLE}/clinical/${SAMPLE}_clinical_summary.tsv | head -20
# Find which clinical variants are also in ClinVar
docker run --rm -v "${GENOME_DIR}:/genome" staphb/bcftools:1.21 \
bcftools isec -n=2 -w1 \
/genome/${SAMPLE}/clinical/${SAMPLE}_clinical.vcf.gz \
/genome/clinvar/clinvar.vcf.gz \
-Oz -o /genome/${SAMPLE}/clinical/${SAMPLE}_clinical_clinvar.vcf.gz
The _clinical.vcf.gz file is small enough to load in IGV Web or gene.iobio for visual inspection.
bcftools +split-vep to parse VEP’s pipe-delimited CSQ annotation structurally (not grep)