genomics-pipeline

Step 14: Imputation Preparation

What This Does

Prepares a WGS VCF for submission to the Michigan Imputation Server (MIS) or TOPMed Imputation Server. Splits the VCF by chromosome, filters to PASS variants, and converts to the required format.

Why

Imputation servers statistically infer missing genotypes using large reference panels. For WGS data, imputation is primarily useful for phasing (determining which alleles are on the same chromosome) rather than filling in missing variants. Phased data is required for haplotype-level analyses and accurate PRS calculation.

Tool

Docker Image

staphb/bcftools:1.21

Command

SAMPLE=your_sample
GENOME_DIR=/path/to/your/data

# Step 1: Filter to PASS variants only
docker run --rm \
  -v ${GENOME_DIR}/${SAMPLE}/vcf:/data \
  staphb/bcftools:1.21 \
  bcftools view -f PASS \
    /data/${SAMPLE}.vcf.gz \
    -Oz -o /data/${SAMPLE}_pass.vcf.gz

# Step 2: Index the filtered VCF
docker run --rm \
  -v ${GENOME_DIR}/${SAMPLE}/vcf:/data \
  staphb/bcftools:1.21 \
  bcftools index -t /data/${SAMPLE}_pass.vcf.gz

# Step 3: Split by chromosome (chr1-22, autosomes only)
for CHR in $(seq 1 22); do
  docker run --rm \
    -v ${GENOME_DIR}/${SAMPLE}/vcf:/data \
    staphb/bcftools:1.21 \
    bcftools view -r chr${CHR} \
      /data/${SAMPLE}_pass.vcf.gz \
      -Oz -o /data/imputation/chr${CHR}.vcf.gz

  docker run --rm \
    -v ${GENOME_DIR}/${SAMPLE}/vcf:/data \
    staphb/bcftools:1.21 \
    bcftools index -t /data/imputation/chr${CHR}.vcf.gz
done

# Output: 22 per-chromosome VCF files in /data/imputation/

Server Options

| Server | Panel | Samples | Build | URL | |—|—|—|—|—| | Michigan (MIS) | HRC r1.1 | 32,470 | GRCh37/38 | imputationserver.sph.umich.edu | | TOPMed | TOPMed r2 | 132,070 | GRCh38 native | imputation.biodatacatalyst.nhlbi.nih.gov |

Important Notes