Personal-Genome-Pipeline

Step 26: Ancestry SNP Intersection [EXPERIMENTAL]

What This Does

Intersects your sample’s common SNPs with the 1000 Genomes Project reference panel and runs LD pruning. The script attempts single-sample PCA with plink2, but single-sample PCA is mathematically degenerate — PCA defines axes from variance across a cohort (Price et al. 2006), so one sample cannot produce interpretable principal components. The output is best understood as a prepared SNP set for users who want to extend it with joint multi-sample PCA against a reference panel.

Why

The intermediate outputs (shared SNPs, LD-pruned variant set) are useful for two reasons:

  1. PRS interpretation: Polygenic risk scores (step 25) are ancestry-dependent. The ancestry SNP set helps identify which population reference to use.
  2. Variant filtering: Population-specific variant frequencies help distinguish benign variants from truly rare findings.

This step does NOT produce a usable ancestry estimate. For ancestry analysis from WGS, you need joint PCA or admixture analysis against a multi-population reference panel (not implemented here). For a quick ancestry check, consumer services (23andMe, AncestryDNA) or tools like Gnomix with a reference cohort are more appropriate.

Tool

Docker Images

pgscatalog/plink2:2.00a5.10
staphb/bcftools:1.21

Input

Command

./scripts/26-ancestry.sh your_name

What the Script Does Internally

  1. Downloads 1000 Genomes reference SNPs (one-time, ~100 MB): fetches the GRCh38 biallelic SNV sites file and filters to common autosomal SNPs (MAF 5-95%)
  2. Downloads population labels: maps each 1000G sample to its super-population (AFR, AMR, EAS, EUR, SAS)
  3. Intersects your VCF with the reference: finds SNPs present in both your sample and the 1000G panel using bcftools isec
  4. LD prunes: removes correlated SNPs (window 50, step 5, r-squared threshold 0.2) to avoid redundant signal. PCA requires independent markers.
  5. Runs PCA: computes the first 10 principal components from the LD-pruned SNP set

Output

File Contents
${SAMPLE}_pca.eigenvec Principal component values (10 PCs per sample)
${SAMPLE}_pca.eigenval Eigenvalues showing variance explained by each PC
${SAMPLE}_shared.vcf.gz SNPs shared between your sample and 1000G
${SAMPLE}_ld.prune.in SNPs retained after LD pruning
${SAMPLE}_ld.prune.out SNPs removed by LD pruning

All output is written to ${GENOME_DIR}/${SAMPLE}/ancestry/. Reference data is cached in ${GENOME_DIR}/ancestry_ref/.

Runtime

~15-30 minutes (dominated by the initial 1000G download on first run; subsequent runs are faster).

Interpreting Results

The eigenvec file contains your sample’s coordinates on 10 principal components. The eigenval file shows how much variance each PC explains.

Single-sample limitation

This script runs PCA on your sample alone, not jointly with the 1000G reference panel. This is a fundamental limitation: in population-structure PCA (Price et al. 2006), the PC axes are defined by the variance across many individuals. With a single sample, the axes instead capture internal genotype variance (e.g., heterozygosity patterns), which does not map onto population-level structure.

The PC values from this step are not comparable to published 1000G PCA plots, where PC1 separates African from non-African ancestry and PC2 separates European from East Asian. Those axis interpretations require joint PCA across a multi-population cohort.

To properly place yourself on a population map, you would need to:

  1. Download the full 1000G genotype data (~30-50 GB)
  2. Merge your sample with the 1000G samples
  3. Run joint PCA on the combined dataset
  4. Plot your sample against the 1000G population clusters

This pipeline does not perform joint PCA. The single-sample output is included as a starting point for users who want to extend it with their own reference panel.

Limitations

Notes