genomics-pipeline

Step 25: Polygenic Risk Scores (PRS)

What This Does

Calculates polygenic risk scores for 10 common conditions using validated scoring files from the PGS Catalog and plink2. Each PRS aggregates the tiny effects of hundreds to millions of genetic variants into a single number representing your relative genetic predisposition for a trait or disease.

Why

Most common diseases (heart disease, diabetes, cancer) are not caused by a single gene. They result from the combined effect of many variants, each contributing a small amount of risk. A PRS sums these contributions using weights derived from large genome-wide association studies (GWAS). While no single variant is predictive on its own, the aggregate score can be informative.

Tool

Docker Image

pgscatalog/plink2:2.00a5.10

Input

Command

./scripts/25-prs.sh your_name

Conditions Scored

Condition PGS ID Source
Coronary artery disease PGS000018 Khera et al. 2018
Type 2 diabetes PGS000014 Mahajan et al. 2018
Breast cancer PGS000004 Mavaddat et al. 2019
Prostate cancer PGS000662 Conti et al. 2021
Atrial fibrillation PGS000016 Khera et al. 2018
Alzheimer’s disease PGS000334 De Rojas et al. 2021
Body mass index PGS000027 Khera et al. 2019
Schizophrenia PGS000738 PGC 2022
Inflammatory bowel disease PGS000020 Khera et al. 2018
Colorectal cancer PGS000055 Huyghe et al. 2019

What the Script Does Internally

  1. Downloads GRCh38-harmonized scoring files from the PGS Catalog FTP (one-time, cached in ${GENOME_DIR}/prs_scores/)
  2. Converts your VCF to plink2 binary format (pgen/pvar/psam), restricting to autosomes (chr1-22) and assigning variant IDs in chr:pos format (matching PGS Catalog convention)
  3. For each scoring file, reformats the PGS Catalog columns (chromosome, position, effect allele, weight) into plink2’s --score input format, deduplicating entries with the same variant ID and allele
  4. Runs plink2 --score for each condition, producing a .sscore file with the aggregate score and the number of variants matched
  5. Collects all results into a summary TSV

Output

File Contents
${SAMPLE}_prs_summary.tsv Tab-delimited summary: condition, PGS ID, score, variants used, variants total
${PGS_ID}.sscore Raw plink2 score output per condition
${PGS_ID}_formatted.tsv Reformatted scoring file used for each calculation
${SAMPLE}.pgen/.pvar/.psam plink2 binary genotype files (intermediate)

All output is written to ${GENOME_DIR}/${SAMPLE}/prs/.

Runtime

~20-40 minutes total (dominated by VCF-to-plink conversion and scoring across all 10 conditions).

Interpreting Results

The summary TSV contains a raw score for each condition. Here is what the columns mean:

What these scores are NOT

How to make them meaningful

Raw PRS become useful only when compared against a population distribution. To convert your score into a percentile, you need a reference panel of thousands of individuals with scores computed using the same scoring file. The PGS Catalog provides some population-level statistics, but full percentile calculation requires a reference cohort (not included in this pipeline).

Comparing two people is only defensible when both were scored with the same PGS ID, the same scoring file version, the same genome build conventions, and the same preprocessing. Even then, treat the comparison as directional rather than clinically calibrated unless you also have a matched reference distribution.

As a rough guide:

Variant matching

Check the Variants_Used / Variants_Total ratio. If fewer than 50% of scoring variants matched, the score is less reliable. Low matching rates usually indicate:

Limitations

Notes

Maintenance