Comprehensive pharmacogenomic star allele calling for 23 clinically actionable genes, including structural variation (SV) detection from BAM read depth. Complements PharmCAT (step 7) by covering genes that VCF-only callers miss entirely, most notably CYP2D6 gene deletions and duplications.
PharmCAT (step 7) calls ~23 genes from VCF data alone. This works well for simple SNP-based star alleles but fails for genes with structural variation — CYP2D6, CYP2A6, GSTM1, and GSTT1 all have common whole-gene deletions and duplications that VCF callers cannot represent. CYP2D6 alone affects 25% of all prescribed drugs, and PharmCAT frequently returns “Not called” for it.
Cyrius (step 21) was designed specifically for CYP2D6 but can fail on some WGS samples due to CYP2D7 pseudogene homology. pypgx uses a different read-depth algorithm that handles this homology more robustly.
pypgx also calls genes absent from PharmCAT entirely: COMT, MTHFR, ABCB1, GSTM1, GSTT1, and IFNL3.
quay.io/biocontainers/pypgx:0.26.0--pyh7e72e81_0
pypgx requires a companion data bundle containing 1000 Genomes phasing panels (for Beagle haplotype estimation) and CNV classifier models. The bundle is not included in the Docker image and must be downloaded separately (~370 MB):
cd ${GENOME_DIR}/reference
git clone --branch 0.26.0 --depth 1 https://github.com/sbslee/pypgx-bundle.git
The bundle version must match the pypgx version (0.26.0). The script validates the bundle exists at ${GENOME_DIR}/reference/pypgx-bundle/ and exits with download instructions if missing.
${GENOME_DIR}/${SAMPLE}/aligned/${SAMPLE}_sorted.bam (+ .bai index)${GENOME_DIR}/${SAMPLE}/vcf/${SAMPLE}.vcf.gz (+ .tbi index)./scripts/32-pypgx.sh your_name
The script calls 23 curated genes spanning CPIC Level A/B recommendations and key genes PharmCAT misses.
These genes have common whole-gene deletions, duplications, or hybrid alleles that cannot be detected from VCF data alone. pypgx uses BAM read depth to identify copy number changes.
| Gene | Why BAM-based | Clinical Impact |
|---|---|---|
| CYP2D6 | Deletions (5), duplications (1x2, *2x2), CYP2D7 hybrids | 25% of all drugs; codeine, tamoxifen, SSRIs |
| CYP2A6 | Whole-gene deletion (*4), duplications | Nicotine metabolism, tegafur |
| GSTM1 | Homozygous deletion (null genotype) in ~50% of population | Detoxification, carcinogen metabolism |
| GSTT1 | Homozygous deletion (null genotype) in ~20% of population | Detoxification, drug conjugation |
These are called using both BAM and VCF data. Star alleles are determined from variant calls.
| Gene | Key Drugs | CPIC Level | Notes |
|---|---|---|---|
| CYP1A2 | Caffeine, clozapine, theophylline | B | |
| CYP2B6 | Efavirenz, methadone | A | |
| CYP2C9 | Warfarin, phenytoin, NSAIDs | A | |
| CYP2C19 | Clopidogrel, SSRIs, PPIs | A | |
| CYP3A4 | Tacrolimus (with CYP3A5) | B | |
| CYP3A5 | Tacrolimus | A | |
| CYP4F2 | Warfarin (vitamin K cycle) | B | |
| DPYD | Fluorouracil, capecitabine | A | |
| TPMT | Azathioprine, mercaptopurine | A | |
| NUDT15 | Azathioprine, mercaptopurine | A | |
| UGT1A1 | Irinotecan, atazanavir, bilirubin | A | |
| SLCO1B1 | Simvastatin, statins | A | |
| VKORC1 | Warfarin | A | |
| NAT2 | Isoniazid, hydralazine | A | |
| COMT | Catecholamine metabolism | – | Not in PharmCAT |
| MTHFR | Folate metabolism, methotrexate | – | Not in PharmCAT |
| ABCB1 | Drug efflux (broad substrate range) | – | Not in PharmCAT |
| G6PD | Rasburicase, primaquine, dapsone | A | |
| IFNL3 | Peginterferon (historical, DAAs replaced) | A |
CYP2D6 is the most complex pharmacogene. It has a tandemly duplicated pseudogene (CYP2D7) with >90% sequence identity, and common structural variants in the general population:
pypgx detects these by analyzing read depth across the CYP2D6/CYP2D7 locus. A drop in coverage indicates deletion; elevated coverage indicates duplication. VCF data alone cannot represent these copy number changes, which is why PharmCAT returns “Not called” for CYP2D6 in most WGS samples.
The BAM-based calling uses both --variants and --depth-of-coverage per the upstream pypgx WGS workflow, combining SNV/haplotype information with read-depth SV detection for the most complete genotype call.
All output is written to ${GENOME_DIR}/${SAMPLE}/pypgx/.
| File | Contents |
|---|---|
<gene>/results.zip |
Per-gene pypgx archive with genotype data |
${SAMPLE}_pypgx_summary.tsv |
Consolidated: gene, diplotype, phenotype, SV flag, source |
${SAMPLE}_pharmcat_comparison.tsv |
Side-by-side comparison with PharmCAT (if step 7 was run) |
| Column | Description |
|---|---|
| Gene | Gene symbol |
| Diplotype | Star allele call (e.g., 1/4) |
| Phenotype | Metabolizer status (e.g., Intermediate Metabolizer) |
| SV_detected | Whether structural variation was detected |
| Source | BAM (SV genes) or VCF (variant-based genes) |
| Column | Description |
|---|---|
| Gene | Gene symbol |
| PharmCAT_diplotype | Diplotype from step 7 |
| pypgx_diplotype | Diplotype from this step |
| Match | Yes, No, pypgx only, or PharmCAT only |
~20-40 minutes for 23 genes. All genes run sequentially inside a single Docker container to avoid repeated container startup overhead.
--cpus 4 --memory 8g
The BAM-based SV detection (CYP2D6, CYP2A6, GSTM1, GSTT1) is the most memory-intensive phase due to read-depth calculation across the locus.
| Aspect | PharmCAT (step 7) | pypgx (step 32) |
|---|---|---|
| Input | VCF only | BAM + VCF |
| CYP2D6 SVs | Cannot detect | Read-depth detection |
| GSTM1/GSTT1 | Not called | Deletion detection |
| COMT, MTHFR | Not covered | Covered |
| Drug recommendations | Yes (HTML/JSON report) | No (star alleles only) |
| CPIC integration | Built-in | Manual lookup via CPIC guidelines |
| Validation | Widely used (research tool) | Less extensively validated |
The two tools are complementary. PharmCAT provides drug recommendations for the genes it covers; pypgx extends coverage to genes PharmCAT cannot call. When both tools call the same gene, concordance is expected for simple genotypes but discrepancies can occur for complex haplotypes. Neither tool is definitively “correct” in all cases — discrepancies should be investigated by examining the underlying variant calls. Neither tool constitutes a clinical test; results should be confirmed by a certified pharmacogenomics laboratory before making prescribing decisions.
Aldy is widely considered the best CYP2D6 caller, with superior handling of complex structural rearrangements and hybrid alleles. However, Aldy is released under a custom academic-only license that is incompatible with GPL-3.0 redistribution. pypgx (Apache-2.0) is the GPL-compatible alternative used in this pipeline. If you are using this pipeline for personal/academic analysis and Aldy’s license terms are acceptable, it can be run separately.
BAM_GENES and VCF_GENES variables in the script to add more.'Series' object has no attribute 'Haplotype1'). In practice, GSTT1 often succeeds while CYP2D6, CYP2A6, and GSTM1 fail. All VCF-based genes are unaffected. For CYP2D6, Cyrius (step 21) or Aldy are alternatives. This is an upstream pypgx issue — monitor pypgx releases for a fix.0.26.0 in versions.env. Check pypgx releases periodically for updates to star allele definitions or algorithm improvements.cd ${GENOME_DIR}/reference && rm -rf pypgx-bundle && git clone --branch <new_version> --depth 1 https://github.com/sbslee/pypgx-bundle.git