You’ve run the pipeline. Now you have directories full of VCFs, TSVs, and HTML reports. This guide explains what to look at first and what it all means — no bioinformatics degree required.
If this is your first time looking at your own genomic data, the numbers can be alarming. Here is what a completely normal, healthy person’s genome looks like:
| Finding | Normal Range | Why It Seems Scary |
|---|---|---|
| Total variants | 4.5-5.5 million | Sounds like millions of “mutations” — but >99.9% are normal human variation |
| ClinVar pathogenic hits (step 6) | 0-10 | Step 6 screens against Pathogenic/Likely_pathogenic only. Most hits are recessive carriers — you need TWO copies to be affected |
| HIGH impact variants (VEP) | 100-150 | Most are heterozygous in non-essential genes. For recessive genes, one copy is typically tolerated (but see haploinsufficiency). |
| Structural variants | 5,000-10,000 | Most are in non-coding regions. Your parents had them too. |
| Heteroplasmic mitochondrial variants | 20-40 | Low-level heteroplasmy (<5%) is universal and age-related |
| VUS (Variants of Uncertain Significance) | 20-200+ | “Uncertain” means not enough data yet, not “probably bad” |
The single biggest source of unnecessary anxiety in personal genomics is VUS — Variants of Uncertain Significance. These are variants where:
Rule of thumb: If a variant is classified as VUS, treat it the same as if it were not tested. Do not change screening or management based on a VUS. Check back in 1-2 years when ClinVar may have reclassified it.
One nuance: If you have a strong family history of a condition AND a VUS appears in the relevant high-penetrance gene (e.g., BRCA1/2, TP53, MLH1/MSH2), it may be worth mentioning to a genetic counselor — not to act on the VUS, but because the family history itself may warrant enhanced screening regardless of the variant’s classification.
Not all ClinVar classifications are equally reliable. Each entry has a review status indicated by stars:
| Stars | Review Status | Reliability |
|---|---|---|
| 0 | No assertion criteria | Low — submitter did not explain their reasoning |
| 1 | Single submitter, criteria provided | Moderate — one lab’s interpretation |
| 2 | Two or more submitters, no conflict | Good — multiple labs agree |
| 3 | Expert panel reviewed | High — reviewed by specialists |
| 4 | Practice guideline | Highest — established clinical standard |
Focus on 2+ star entries. Single-submitter (1-star) pathogenic calls are sometimes reclassified. If you find a scary-looking pathogenic variant with 0-1 stars, check the ClinVar entry directly at ncbi.nlm.nih.gov/clinvar — look at the “Review status” and “Last evaluated” date.
The most common “pathogenic” finding in any genome is heterozygous carrier status for recessive conditions. This means:
What it tells you: Known pathogenic variants in your genome, as classified by ClinVar (NCBI’s public database of clinically significant variants).
Where to look: ${SAMPLE}/clinvar/
How to read it:
CLNSIG field confirms the classificationWhat to expect:
When to worry:
What it tells you: How your genes affect drug metabolism. These results are clinically relevant and should be shared with your prescribing physician.
Where to look: ${SAMPLE}/pharmcat/ (Nextflow) or ${SAMPLE}/vcf/ (bash scripts) — PharmCAT writes its reports there. Open the HTML report in a browser.
Key genes to check: | Gene | Affects | Common Impact | |—|—|—| | CYP2C19 | PPIs, clopidogrel, SSRIs, voriconazole | Rapid metabolizers burn through drugs too fast | | CYP2D6 | Codeine, tramadol, tamoxifen, many psych meds | Poor metabolizers get toxic buildup | | CYP2C9 | Warfarin, NSAIDs, phenytoin | Dose adjustment needed | | DPYD | 5-fluorouracil (cancer drug) | Poor metabolizers can die from standard doses | | SLCO1B1 | Statins (simvastatin, atorvastatin) | Increased myopathy risk | | NAT2 | Isoniazid (TB), caffeine | Slow acetylators have more side effects | | UGT1A1 | Irinotecan, atazanavir | 28/28 = Gilbert syndrome (elevated bilirubin) |
What to do: Share the PharmCAT report with your prescribing physician or pharmacist. PharmCAT is a research tool — its authors explicitly note that missing positions, unphased input, and undetected structural variation (especially CYP2D6) can affect genotype and phenotype calls. The report is a valuable starting point for pharmacogenomic-guided prescribing, but clinical confirmation may be warranted before making medication changes, especially for high-risk drugs (DPYD, CYP2D6-dependent opioids).
What it tells you: Cancer predisposition screening using CPSR’s curated cancer gene panels (panel 0 covers 500+ genes). This is broader than the 81-gene ACMG SF v3.2 list and focused specifically on cancer predisposition.
Where to look: ${SAMPLE}/cpsr/ — open the HTML report in a browser.
How to read it:
Unlike SNPs (single letter changes), structural variants are large rearrangements:
If you run Manta or Delly, you will see many BND calls — often hundreds or thousands. BND indicates a “breakend” where one end of a read pair maps to a different chromosome or a distant location. This sounds alarming (“translocation!”) but:
A typical human genome has:
If you ran multiple SV callers:
The AnnotSV TSV (step 5) adds clinical annotations to each SV:
ACMG_class: 1 (benign) to 5 (pathogenic)Overlapped_CDS_percent: How much of a gene is affectedACMG_class 4 or 5 that overlap known disease genesSome regions of DNA have short sequences repeated many times (e.g., CAG CAG CAG…). When the number of repeats exceeds a threshold, it can cause disease.
The output VCF lists each tested locus with the number of repeats found. Key loci:
| Locus | Gene | Normal | Intermediate / Premutation | Pathogenic | Disease |
|---|---|---|---|---|---|
| HTT | HTT | <=26 | 27-35 (mutable normal); 36-39 (reduced penetrance) | >=40 (full penetrance) | Huntington’s disease |
| FMR1 | FMR1 | <45 | 45-54 (intermediate); 55-200 (premutation) | >200 | Fragile X syndrome |
| ATXN1 | ATXN1 | <33 | — | >39 | Spinocerebellar ataxia 1 |
| C9orf72 | C9orf72 | <24 | 24-30 (gray zone, lab cutoffs vary) | Typically hundreds-thousands | ALS / Frontotemporal dementia |
| DMPK | DMPK | <35 | 35-49 (premutation) | >=50 | Myotonic dystrophy type 1 |
HTT 36-39 repeats (reduced penetrance): Individuals in this range may or may not develop Huntington’s disease. The risk increases with repeat length but is not certain. GeneReviews classifies >=40 as full penetrance and 36-39 as reduced penetrance. Alleles of 27-35 (“mutable normal”) do not cause disease but may expand in offspring.
C9orf72 gray zone: The exact pathogenic threshold for C9orf72 is not established. Laboratory cutoffs are discordant (JNNP 2021 review). Clearly pathogenic expansions are typically hundreds to thousands of repeats. Short-read WGS has limited ability to size very large expansions accurately.
FMR1 intermediate zone (45-54 repeats): Not affected, but repeats may expand in offspring. Carriers should receive genetic counseling. Premutation (55-200) carries risk of FXTAS (males >50) and FXPOI.
“ALL CLEAR” means no locus exceeded its clearly pathogenic threshold. Intermediate-range results should be discussed with a genetic counselor.
Telomeres are protective caps at the ends of chromosomes that shorten with cell division. TelomereHunter measures tel_content — the normalized telomere read count — as a proxy for relative telomere content.
tel_content = more telomeric reads (generally correlates with longer telomeres)tel_content = fewer telomeric reads (generally correlates with shorter telomeres)tel_content for 30X WGS: 300-800 (varies by sequencing platform and coverage)Runs of Homozygosity (ROH) are long stretches where both copies of your DNA are identical. Everyone has some ROH, but extensive ROH can indicate:
Some apparent ROH near centromeres are artifacts of low-coverage sequencing in repetitive regions. These can be ignored.
Your mitochondrial haplogroup traces your maternal ancestry lineage. It’s determined by the specific set of variants in your mitochondrial DNA (inherited only from your mother).
| Haplogroup | Origin | Notes |
|---|---|---|
| H | Western Europe | Most common in Europe (~40%) |
| U | Northern/Eastern Europe | Second most common (~15%) |
| T | Near East / Mediterranean | ~10% of Europeans |
| K | Near East | ~6% of Europeans, Ashkenazi ~30% |
| J | Near East | ~8% of Europeans |
| V | Iberian Peninsula / Scandinavia | ~4% |
| I | Near East / Europe | ~3% |
Mitochondrial haplogroups have weak associations with some diseases (Parkinson’s, diabetes, longevity), but these are population-level statistics, not individual predictions. The main clinical value is in step 20 (GATK Mutect2 mitochondrial mode), which detects disease-causing mitochondrial variants and heteroplasmy.
VEP annotates every variant with:
The single most useful annotation VEP adds is the gnomAD allele frequency — how common a variant is in the general population. This pipeline uses VEP’s --af_gnomade flag, which annotates with gnomAD exome frequencies only (not the combined exome+genome dataset). This means non-coding variants outside exome capture regions will lack gnomAD frequency annotations even if they appear in the gnomAD genome dataset. For coding variants (the most clinically relevant), exome frequencies are well-powered.
Key principle: A variant that is common in healthy people is almost certainly benign, regardless of what any prediction tool says.
| gnomAD AF | Interpretation | Action |
|---|---|---|
| > 5% (0.05) | Common polymorphism | Benign. Ignore. |
| 1-5% | Low-frequency variant | Almost certainly benign |
| 0.1-1% | Uncommon | Probably benign, but check ClinVar |
| 0.01-0.1% | Rare | Worth investigating if in a disease gene |
| < 0.01% | Very rare | Potentially significant. Check ClinVar + literature |
| Absent | Novel or ultra-rare | Could be significant OR a sequencing artifact. Verify with a second method |
If a variant is “pathogenic” in ClinVar but has gnomAD AF > 1%: The ClinVar entry may be outdated or wrong. Truly pathogenic variants for severe diseases are almost always rare (< 0.1%) because natural selection removes them from the population.
For finding potentially significant variants in the annotated VCF:
# High-impact variants (loss of function: stop-gain, frameshift, splice donor/acceptor)
grep "HIGH" ${SAMPLE}_vep.vcf | grep -v "^#"
# Rare missense variants predicted damaging by both SIFT and PolyPhen
grep "missense_variant" ${SAMPLE}_vep.vcf | grep "deleterious" | grep "probably_damaging"
# Variants absent from gnomAD (novel/ultra-rare)
grep "missense_variant" ${SAMPLE}_vep.vcf | grep -v "gnomAD_AF"
VEP classifies variant impact as: | Impact | Types | Interpretation | |—|—|—| | HIGH | Stop gained, frameshift, splice donor/acceptor | Likely breaks the protein | | MODERATE | Missense, in-frame insertion/deletion | Changes the protein, may or may not matter | | LOW | Synonymous, splice region | Probably no functional effect | | MODIFIER | Intronic, intergenic, UTR | Usually non-functional |
Everyone has ~100-150 HIGH impact variants. Most are in one copy (heterozygous) of non-essential genes. Don’t panic at the number.
If you ran step 30, your VCF now includes quantitative pathogenicity scores beyond VEP’s qualitative SIFT/PolyPhen predictions. These scores are widely used as computational evidence in variant interpretation (see ClinGen’s PP3/BP4 calibration framework).
Scores all variant types (coding, non-coding, splice, regulatory). Uses a PHRED-like scale where higher = more deleterious.
| CADD PHRED | Interpretation | Context |
|---|---|---|
| < 10 | Likely benign | Bottom 90% of genome variation |
| 10-20 | Uncertain | Top 10%, but most are still benign |
| 20-25 | Potentially deleterious | Top 1% — investigate if in a disease gene |
| 25-30 | Likely deleterious | Top 0.3% — strong candidate for pathogenicity |
| > 30 | Highly deleterious | Top 0.1% — likely damaging if in a constrained gene |
When to use CADD: Best for non-coding and splice-region variants where SIFT/PolyPhen don’t apply. For missense variants, REVEL and AlphaMissense are more specific.
Scores missense variants only. Combines 13 individual tools into a single 0-1 score. Recommended by ClinGen for ACMG PP3/BP4 evidence.
| REVEL Score | ClinGen Evidence Level | Interpretation |
|---|---|---|
| < 0.290 | BP4_Supporting | Supporting evidence of benign |
| 0.290-0.644 | No evidence | Uncertain significance |
| 0.644-0.773 | PP3_Moderate | Moderate evidence of pathogenicity |
| 0.773-0.932 | PP3_Strong | Strong evidence of pathogenicity |
| > 0.932 | PP3_Very Strong | Very strong evidence of pathogenicity |
When to use REVEL: First-line score for evaluating missense variants. If REVEL >= 0.644, investigate the variant seriously.
DeepMind’s protein-structure-informed missense classifier. Uses AlphaFold2 protein structure predictions to assess amino acid substitution impact.
| am_pathogenicity | am_class | Interpretation |
|---|---|---|
| < 0.34 | likely_benign | Predicted benign by protein structure analysis |
| 0.34-0.564 | ambiguous | Uncertain — use other evidence |
| > 0.564 | likely_pathogenic | Predicted damaging based on protein structure |
When to use AlphaMissense: Complements REVEL. If both REVEL and AlphaMissense agree a variant is damaging, this strengthens the computational evidence (ClinGen PP3). If they disagree, investigate further.
Deep learning model predicting splice-altering variants. Scores four types of splice disruption: acceptor gain (AG), acceptor loss (AL), donor gain (DG), donor loss (DL).
| Max Delta Score | Interpretation |
|---|---|
| < 0.2 | Unlikely to affect splicing |
| 0.2-0.5 | May affect splicing — investigate |
| 0.5-0.8 | Likely affects splicing |
| > 0.8 | Strong evidence of splice disruption |
When to use SpliceAI: VEP already flags canonical splice site variants (GT/AG dinucleotides). SpliceAI catches cryptic splice variants — intronic or exonic variants that create new splice sites or disrupt existing ones through more subtle mechanisms.
These are per-gene metrics (not per-variant) added to the clinical filter summary TSV. They measure how intolerant a gene is to different types of mutations.
| Metric | Threshold | Meaning |
|---|---|---|
| LOEUF < 0.35 | Constrained for loss-of-function | Gene is intolerant to LoF mutations — a HIGH impact variant here is more likely to cause disease |
| pLI >= 0.9 | Loss-of-function intolerant | Same as LOEUF but older metric. LOEUF is preferred. |
| mis_Z > 3.09 | Constrained for missense | Gene is intolerant to missense mutations — a REVEL-high missense here is more concerning |
Combining scores: A rare variant (gnomAD AF < 0.01%) with CADD > 25, in a constrained gene (LOEUF < 0.35), with ClinVar pathogenic classification, is a high-confidence pathogenic finding. Any one of these alone is insufficient.
Copy-paste these commands to extract the most clinically relevant variants. All assume your VEP-annotated VCF is at ${GENOME_DIR}/${SAMPLE}/vep/${SAMPLE}_vep.vcf.
VEP_VCF="${GENOME_DIR}/${SAMPLE}/vep/${SAMPLE}_vep.vcf"
# 1. Homozygous loss-of-function variants (most likely to cause disease)
grep -v "^#" "$VEP_VCF" | grep "HIGH" | grep "1/1" | head -20
# 2. Rare HIGH-impact variants (gnomAD AF < 0.1%)
# These are the variants most likely to be clinically significant
grep -v "^#" "$VEP_VCF" | grep "HIGH" | grep -v "gnomADe_AF=0\.[0-9]" | head -20
# 3. Compound heterozygous candidates: genes with 2+ heterozygous variants
# (potential autosomal recessive — needs manual curation)
grep -v "^#" "$VEP_VCF" | grep "0/1" | grep -oP 'SYMBOL=[^;|]+' | \
sort | uniq -c | sort -rn | awk '$1 >= 2' | head -20
# 4. Known ACMG actionable genes (81 genes in ACMG SF v3.2)
# Quick check if any HIGH/MODERATE variants land in these genes
# Note: this is a partial list of cancer-related genes for illustration.
# See https://www.nature.com/articles/s41436-023-02171-w for the full 81-gene list.
ACMG_GENES="BRCA1|BRCA2|MLH1|MSH2|MSH6|PMS2|APC|MUTYH|TP53|RB1|MEN1|RET|VHL|SDHB|SDHD|TSC1|TSC2|WT1|NF2|PTEN|STK11|BMPR1A|SMAD4|CDH1|PALB2|CHEK2|ATM|NBN|BARD1|RAD51C|RAD51D|BRIP1"
grep -v "^#" "$VEP_VCF" | grep -E "HIGH|MODERATE" | grep -E "$ACMG_GENES" | head -20
# 5. PharmCAT-relevant variants not caught by step 7
# (PharmCAT misses some alleles — check CYP2D6, DPYD, UGT1A1 manually)
grep -v "^#" "$VEP_VCF" | grep -E "CYP2D6|CYP2C19|CYP2C9|DPYD|UGT1A1|SLCO1B1|TPMT|NUDT15" | head -20
Important: These are starting points, not definitive screens. Any interesting finding should be cross-referenced with ClinVar and ideally confirmed by a second method (Sanger sequencing or a clinical lab).
CNVnator detects copy number variants using read depth analysis — complementary to Manta’s paired-end/split-read approach.
Where to look: ${SAMPLE}/cnvnator/${SAMPLE}_cnvs.txt
Format: Each line has: type, region, size, normalized_RD, e-value1, e-value2, e-value3, e-value4, q0
What to expect:
Filtering:
# Significant CNVs only (e-value < 0.01)
awk '$5 < 0.01' ${SAMPLE}_cnvs.txt
# Large deletions (>100kb, potentially clinically relevant)
awk '$1 == "deletion" && $3 > 100000 && $5 < 0.01' ${SAMPLE}_cnvs.txt
# Large duplications
awk '$1 == "duplication" && $3 > 100000 && $5 < 0.01' ${SAMPLE}_cnvs.txt
Multi-caller overlap: CNVs found by both Manta AND CNVnator have lower false-positive rates. Cross-reference by checking if the same genomic region appears in both output files.
Delly is a third structural variant caller, detecting deletions, duplications, inversions, and translocations.
Where to look: ${SAMPLE}/delly/${SAMPLE}_sv.vcf.gz
Quick summary:
# Count SVs by type
bcftools query -f '%INFO/SVTYPE\n' ${SAMPLE}_sv.vcf.gz | sort | uniq -c
# Filter PASS variants only
bcftools view -f PASS ${SAMPLE}_sv.vcf.gz | grep -cv '^#'
What to expect:
Multi-caller overlap: SVs detected by Manta + Delly + CNVnator (or any 2 of 3) have substantially lower false-positive rates. Single-caller calls, especially large ones, should be viewed with caution.
GATK Mutect2 in mitochondrial mode detects variants with heteroplasmy fractions — the proportion of your mitochondria carrying each variant.
Where to look: ${SAMPLE}/mito/${SAMPLE}_chrM_filtered.vcf.gz
Key field: AF (allele fraction) indicates heteroplasmy level:
| AF Level | Meaning |
|---|---|
| >0.95 | Homoplasmic — fixed in all mitochondria (haplogroup-defining) |
| 0.10-0.95 | Heteroplasmic — clinically significant range |
| 0.03-0.10 | Low-level heteroplasmy — often age-related somatic |
| <0.03 | Near detection limit |
What to expect:
When to investigate further:
Important caveats about heteroplasmy thresholds:
Cross-reference: Compare with step 12 (haplogrep3) — your homoplasmic variants should match your assigned haplogroup.
Mutect2 in tumor-only mode looks for somatic mutations – variants acquired during your lifetime rather than inherited. From blood-derived WGS, the main category of interest is clonal hematopoiesis (CHIP).
Where to look: ${SAMPLE}/somatic/${SAMPLE}_somatic_filtered.vcf.gz
Key field: AF (allele fraction) indicates the clone size:
| AF Range | Likely Source |
|---|---|
| 0.45-0.55 | Heterozygous germline (false positive) |
| ~1.0 | Homozygous germline (false positive) |
| 0.01-0.10 | Potential low-frequency somatic (CHIP candidate) |
| 0.10-0.40 | Ambiguous – could be somatic, mosaic, or noisy germline |
What to expect:
CHIP genes to check: DNMT3A, TET2, ASXL1, TP53, JAK2, SF3B1, SRSF2, PPM1D, CBL. CHIP prevalence increases with age and is associated with elevated cardiovascular risk and risk of hematologic malignancies.
Important caveats:
Genomic databases are updated continuously. Variants classified as VUS today may be reclassified next year. Periodic re-analysis is one of the most valuable things you can do.
| Database | Update Frequency | Pipeline Steps Affected | How to Update |
|---|---|---|---|
| ClinVar | Weekly | Step 6 (ClinVar screen) | Re-download from NCBI FTP (see 00-reference-setup.md) |
| VEP cache | Every 6 months | Step 13 (VEP annotation) | Download new release from Ensembl FTP |
| PCGR/CPSR data | Every 6-12 months | Step 17 (CPSR) | Download new bundle from PCGR GitHub releases |
| PharmCAT | Every few months | Step 7 (pharmacogenomics) | Pull new Docker image (docker pull pgkb/pharmcat:3.2.0) |
Sanitized examples from a real 30X WGS run, so you know what to expect.
bcftools stats output:
SN 0 number of samples: 1
SN 0 number of records: 5560412
SN 0 number of SNPs: 4198753
SN 0 number of indels: 1361659
SN 0 number of multiallelic sites: 45231
# PASS variants only: 4,650,000-4,700,000
# Ti/Tv ratio: 2.05-2.10 (if < 1.8, something is wrong)
Pathogenic/Likely Pathogenic hits: 4
chr2:47637270 T>C (rs80338939) — GJB2 carrier (hearing loss, recessive)
chr1:45331175 G>A (rs36053993) — MUTYH carrier (CRC risk, recessive)
chr10:124774641 C>T (rs28936670) — ACADSB carrier (metabolic, recessive)
All heterozygous (0/1) = carrier status only. This is a completely normal result.
The HTML report will show a table like:
Gene Diplotype Phenotype Affected Drugs
CYP2C19 *1/*17 Rapid Metabolizer PPIs, SSRIs, clopidogrel
CYP2C9 *1/*1 Normal Metabolizer Warfarin, NSAIDs
NAT2 *5/*6 Slow Acetylator Isoniazid, caffeine
DPYD *1/*1 Normal Metabolizer 5-FU (safe at standard dose)
SLCO1B1 *1a/*1a Normal Function Statins (standard dosing)
Typically 18-21 of 23 genes will have confident calls. CYP2D6 may be “Inconclusive” from short-read WGS (known limitation).
{
"LocusResults": {
"HTT": { "Genotype": "17/19" },
"FMR1": { "Genotype": "29" },
"C9orf72": { "Genotype": "2/3" },
"ATXN1": { "Genotype": "29/29" },
"DMPK": { "Genotype": "12/13" }
}
}
All values well below disease thresholds = ALL CLEAR.
The HTML report tier summary:
Tier 1 (Pathogenic/Likely pathogenic): 0 variants
Tier 2 (VUS with evidence): 3 variants
Tier 3 (VUS limited evidence): 21 variants
Tier 4 (Likely benign/Benign): ~53,000 variants
Zero Tier 1 = ALL CLEAR for cancer predisposition. The VUS count varies widely (20-200+) and is not cause for concern.
# Autosomal ROH > 5 MB: 0
# Total autosomal ROH: 47.3 MB (all segments < 3 MB)
# Conclusion: No evidence of parental relatedness
Normal outbred individual. If total ROH > 100 MB or any segment > 10 MB, investigate further.
tel_content: 553.52
No universal “normal” range — compare between samples of the same age, sequenced on the same platform.
Found something interesting? These free tools help you dig deeper:
An important caveat: different annotation tools may classify the same variant differently.
VEP (used in step 13), SnpEff, and ANNOVAR are the three most common variant annotation tools. Studies have shown that they disagree on consequence predictions for ~5-10% of variants, particularly at splice sites and multi-transcript genes.
What this means for you:
The pipeline uses VEP because it is the most widely used and well-maintained tool, with direct gnomAD frequency integration. But no single tool is perfect.