If you have raw data from a consumer genotyping service instead of whole genome sequencing, you can still run a meaningful subset of this pipeline. This guide explains what works, what doesn’t, and how to get the most out of ~600K SNPs.
| WGS (30X) | Genotyping Array | |
|---|---|---|
| Positions tested | ~3 billion (entire genome) | ~600,000-700,000 (preselected SNPs) |
| Coverage | 100% of sequenceable genome | ~0.02% |
| Can detect | SNPs, indels, SVs, CNVs, repeats, mito | Common SNPs only |
| Cannot detect | — | Rare variants, indels, SVs, STR expansions, CNVs |
| Raw data format | FASTQ (reads) or BAM (aligned reads) | Text file of genotypes (no reads) |
| Typical file size | 60-120 GB | 10-30 MB |
| Cost | $200-1,000 | $79-229 |
Bottom line: Arrays test positions that were preselected because they are common in human populations and useful for ancestry or trait prediction. They miss the rare variants that are most likely to be clinically significant. But for pharmacogenomics and polygenic risk, common variants are exactly what you need.
.txt file (tab-separated, ~25 MB zipped).csv file (~15 MB zipped).txt file (tab-separated, ~15 MB zipped)Important: All three services use GRCh37 (hg19) coordinates. This pipeline requires GRCh38. The conversion steps below handle this.
Consumer chip raw data files contain genotype calls (e.g., “AG” at a position) but do not tell you which allele is the reference allele on the human genome. To create a valid VCF, every position needs its REF allele looked up from a reference FASTA. Without this step, homozygous ALT calls get written as REF/REF and heterozygous calls can have ref/alt swapped — silently corrupting all downstream analysis.
The conversion requires two stages:
bcftools convert --tsv2vcf with a GRCh37 reference FASTAWhy not plink? An earlier version of this guide used plink 1.9 (
--23file) + plink2 (--ref-from-fa). That pipeline silently corrupts homozygous ALT genotypes in single-sample data because plink’s binary format cannot represent both alleles for monomorphic sites.bcftools convert --tsv2vcfreads the reference FASTA directly and handles all genotype classes correctly.
One-time downloads (~3.5 GB total, plus the GRCh38 reference from step 00):
GENOME_DIR=${GENOME_DIR:?Set GENOME_DIR to your data directory}
mkdir -p "${GENOME_DIR}/liftover" "${GENOME_DIR}/reference_hg19"
# GRCh37 reference FASTA (needed for ref/alt resolution, ~3 GB)
wget -q -O "${GENOME_DIR}/reference_hg19/human_g1k_v37.fasta.gz" \
"https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz"
gunzip "${GENOME_DIR}/reference_hg19/human_g1k_v37.fasta.gz"
# Index the reference
docker run --rm --user root \
-v "${GENOME_DIR}:/genome" \
staphb/samtools:1.21 \
samtools faidx /genome/reference_hg19/human_g1k_v37.fasta
# GRCh37-to-GRCh38 liftover chain file (~500 KB)
wget -q -O "${GENOME_DIR}/liftover/hg19ToHg38.over.chain.gz" \
"https://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/hg19ToHg38.over.chain.gz"
GRCh38 reference required: The liftover step (stage 3) needs
Homo_sapiens_assembly38.fastain${GENOME_DIR}/reference/. If you haven’t set up the pipeline’s reference data yet, follow step 00 — reference setup first.
All three vendor formats need to be converted to a tab-separated file with columns: rsID chromosome position genotype (23andMe/AncestryDNA are already in this format). Then bcftools convert --tsv2vcf creates a proper VCF by looking up each position’s reference allele from the FASTA.
A ready-to-use script is provided at scripts/chip-to-vcf.sh. You can also run the steps manually:
SAMPLE=your_name
GENOME_DIR=/path/to/your/data
mkdir -p "${GENOME_DIR}/${SAMPLE}/vcf"
# --- Pre-step: Convert MyHeritage CSV to TSV ---
# (Skip this for 23andMe/AncestryDNA — their files are already TSV)
#
# MyHeritage CSVs have quoted fields and a different header.
# Strip comments, headers, and quotes, then rearrange to TSV.
grep -v "^#" "${GENOME_DIR}/${SAMPLE}/raw/MyHeritage_raw_dna_data.csv" | \
grep -v "^RSID" | \
sed 's/"//g' | \
awk -F',' '{print $1"\t"$2"\t"$3"\t"$4}' \
> "${GENOME_DIR}/${SAMPLE}/raw/${SAMPLE}_raw.txt"
# --- Stage 1: Import genotypes + fix ref/alt (single step) ---
#
# bcftools convert --tsv2vcf reads the TSV and looks up the REF allele
# from the FASTA at each position. This correctly handles:
# - Homozygous reference (e.g., AA where REF=A → GT 0/0)
# - Heterozygous (e.g., AG where REF=A → GT 0/1)
# - Homozygous ALT (e.g., AA where REF=T → GT 1/1)
# The -c flag maps columns: ID=rsid, CHROM=chromosome, POS=position, AA=genotype
docker run --rm --user root \
-v "${GENOME_DIR}:/genome" \
staphb/bcftools:1.21 \
bcftools convert --tsv2vcf "/genome/${SAMPLE}/raw/${SAMPLE}_raw.txt" \
-f /genome/reference_hg19/human_g1k_v37.fasta \
-s "${SAMPLE}" \
-c ID,CHROM,POS,AA \
-Oz -o "/genome/${SAMPLE}/raw/${SAMPLE}_hg19.vcf.gz"
# Add "chr" prefix (required by the liftover chain file)
printf '%s\n' $(seq 1 22) X Y MT | \
awk '{print $1" chr"$1}' > "${GENOME_DIR}/reference_hg19/chr_rename.txt"
docker run --rm --user root \
-v "${GENOME_DIR}:/genome" \
staphb/bcftools:1.21 \
bcftools annotate \
--rename-chrs /genome/reference_hg19/chr_rename.txt \
"/genome/${SAMPLE}/raw/${SAMPLE}_hg19.vcf.gz" \
-Oz -o "/genome/${SAMPLE}/raw/${SAMPLE}_hg19_chr.vcf.gz"
docker run --rm --user root \
-v "${GENOME_DIR}:/genome" \
staphb/bcftools:1.21 \
bcftools index -t "/genome/${SAMPLE}/raw/${SAMPLE}_hg19_chr.vcf.gz"
# --- Stage 2: Liftover to GRCh38 ---
docker run --rm --user root \
-v "${GENOME_DIR}:/genome" \
broadinstitute/picard:latest \
java -jar /usr/picard/picard.jar LiftoverVcf \
I="/genome/${SAMPLE}/raw/${SAMPLE}_hg19_chr.vcf.gz" \
O="/genome/${SAMPLE}/vcf/${SAMPLE}.vcf.gz" \
CHAIN=/genome/liftover/hg19ToHg38.over.chain.gz \
R=/genome/reference/Homo_sapiens_assembly38.fasta \
REJECT="/genome/${SAMPLE}/raw/${SAMPLE}_liftover_rejected.vcf.gz" \
WARN_ON_MISSING_CONTIG=true
# Index the final VCF
docker run --rm --user root \
-v "${GENOME_DIR}:/genome" \
staphb/bcftools:1.21 \
bcftools index -t -f "/genome/${SAMPLE}/vcf/${SAMPLE}.vcf.gz"
echo "Done. VCF at: ${GENOME_DIR}/${SAMPLE}/vcf/${SAMPLE}.vcf.gz"
Expected output: A bgzipped, tabix-indexed VCF on GRCh38 coordinates with ~600,000 SNPs. Typical breakdown: ~430K hom-ref, ~107K het, ~66K hom-alt, ~1K missing. Some variants (~0.3%) are rejected during liftover; ~900 have swapped REF/ALT between builds.
X/Y/MT chromosomes: All chromosomes are converted. For mitochondrial haplogroup estimation from chip data, dedicated tools like HaploGrep accept raw 23andMe files directly.
23andMe / AncestryDNA: These are already tab-separated. Skip the MyHeritage CSV conversion pre-step and place your file directly at
${GENOME_DIR}/${SAMPLE}/raw/${SAMPLE}_raw.txt.
Imputation can expand your 600K chip variants to ~40M by predicting untyped genotypes from population reference panels. This significantly improves PRS variant matching.
Note on Michigan Imputation Server: MIS may require multiple samples per job (see step 14 docs). TOPMed is generally more accessible for single-sample chip data. Check each server’s current policies before uploading.
| Step | Name | Why It Works | Notes |
|---|---|---|---|
| 6 | ClinVar screen | Checks your variants against known pathogenic entries | You’ll only find pathogenic variants that happen to be on the chip. Most clinically significant rare variants will be missed. |
| 7 | PharmCAT | Pharmacogenomic star alleles from SNP genotypes | Calls many genes (CYP2B6, CYP4F2, DPYD, NUDT15, TPMT, SLCO1B1, UGT1A1) but misses key genes like CYP2C19 and VKORC1 on some chip versions due to missing positions. Expect 888+ missing PGx positions. Always compare with WGS results if available. |
| 11 | ROH analysis | Runs of homozygosity from SNP genotypes | Works, but requires the -G30 flag (chip VCFs lack PL tags). Large ROH (>1 MB) are detectable. |
| 25 | PRS | Polygenic risk scores from common variants | Works with no-mean-imputation flag (single sample lacks allele frequencies). Matches ~12% of large scoring files (vs ~28% from WGS). Scores are not directly comparable to WGS scores. |
| 27 | CPIC lookup | Drug-gene recommendations | Works if step 7 (PharmCAT) succeeds. |
| Step | Name | Limitation |
|---|---|---|
| 13 | VEP annotation | Runs, but annotating 600K variants is much less useful than annotating 5M. The rare, potentially significant variants are the ones arrays miss. |
| 17 | CPSR | Runs, but cancer predisposition screening on chip data has very low sensitivity. Most pathogenic variants in cancer genes are rare and not on the chip. A negative CPSR result from chip data does NOT rule out cancer predisposition. |
| Step | Name | Why |
|---|---|---|
| 2 | Alignment | No reads to align |
| 3 | DeepVariant | No BAM file |
| 4, 18, 19 | SV callers (Manta, CNVnator, Delly) | Need read-level evidence |
| 5, 15 | SV annotation (AnnotSV, duphold) | No SV calls to annotate |
| 8 | HLA typing (T1K) | Needs reads spanning HLA region |
| 9 | ExpansionHunter | Needs reads spanning repeat regions |
| 10 | TelomereHunter | Needs telomeric reads |
| 12 | Mito haplogroup | The conversion workflow produces autosomal-only VCF. For mt haplogroup from chip data, use HaploGrep directly with your raw file. |
| 16 | Coverage QC (indexcov) | No BAM to assess coverage |
| 20 | Mito variant calling (Mutect2) | Needs BAM |
| 21 | CYP2D6 (Cyrius) | Needs BAM |
| 22 | SV consensus merge | No SV calls |
| 23 | Clinical filter | Requires VEP-annotated VCF with gnomAD. Limited value on chip data. |
| 26 | Ancestry PCA | The current step 26 implementation requires >=2 samples for PCA and produces no output for a single sample. For ancestry from chip data, use the provider’s built-in ancestry tools or upload to a service like DNA Painter. |
| Analysis | Usefulness | Confidence |
|---|---|---|
| Pharmacogenomics | Moderate-High | Calls many genes correctly, but misses some critical ones (e.g., CYP2C19, VKORC1) depending on chip version. Some calls may differ from WGS (e.g., CYP3A5). |
| Carrier screening | Low-moderate | Only finds carriers for variants on the chip |
| Cancer predisposition | Very low | Most pathogenic variants are rare and NOT on the chip |
| Polygenic risk scores | Low-Moderate | Matches ~12% of large scoring files. Raw scores differ substantially from WGS and are not comparable without a matched reference cohort. |
| Ancestry | N/A via pipeline | Use your provider’s ancestry tools or HaploGrep for mt haplogroup |
| Analysis | Usefulness | Confidence |
|---|---|---|
| PRS | High | Comparable to WGS for well-imputed regions |
| ClinVar screening | Moderate | Imputed rare variants have lower confidence (check R2 scores) |
| PharmCAT | High | Same as chip alone (PGx SNPs are directly genotyped) |
Download raw data from 23andMe / MyHeritage / AncestryDNA
|
v
Convert to GRCh38 VCF (bcftools + liftover, see above)
|
+---> Step 7: PharmCAT -----> Step 27: CPIC drug-gene report
|
+---> Step 25: PRS (polygenic risk scores)
|
+---> Step 6: ClinVar screen (limited but useful)
|
+---> Step 11: ROH analysis
|
(optional)
|
v
Impute via TOPMed server
|
+---> Re-run step 25 with imputed VCF (better PRS)
Minimum useful run (takes ~15 minutes): Steps 7 + 27 (pharmacogenomics + drug recommendations). This is the single highest-value analysis you can do with chip data.
A clean ClinVar screen does NOT mean you have no pathogenic variants. The chip only tests ~600K positions out of 3 billion. The vast majority of known pathogenic variants are rare and not on the chip.
PRS from chip data is reasonable but less precise than PRS from WGS. The scoring files may reference variants that aren’t on your chip and can’t be imputed.
PharmCAT coverage depends on chip version. In our smoke test with MyHeritage GSA chip data, PharmCAT correctly called CYP2B6, CYP4F2, DPYD, NUDT15, TPMT, and UGT1A1 — but failed to call CYP2C19 (25 missing positions) and VKORC1 (1 missing position), and miscalled CYP3A5 as *1/*1 instead of *3/*3 (4 missing positions). 23andMe v5 may cover more PGx positions. PharmCAT will report “Not called” for genes without sufficient data — this is preferable to a wrong call.
Imputed genotypes are predictions, not observations. Imputation accuracy varies by ancestry and local linkage disequilibrium. Common variants (MAF > 5%) impute well. Rare variants (MAF < 1%) impute poorly and should not be used for clinical decisions.
If you find something concerning, get WGS. Chip data is a screening tool. Any significant finding should be confirmed with either WGS or targeted Sanger sequencing through a clinical lab.
| Approach | Cost | Variants | Best For |
|---|---|---|---|
| Chip alone | $0 (already have data) | ~600K | PharmCAT, basic PRS |
| Chip + imputation | $0 (free servers) | ~40M (imputed) | Better PRS |
| Budget WGS (Novogene) | $200-400 | ~5M (observed) | Full pipeline |
| Standard WGS (Nebula, Dante) | $300-600 | ~5M (observed) | Full pipeline + data retention |
If you are serious about genomic analysis, WGS is worth the investment. But if you already have chip data and want to start exploring, the pharmacogenomics and PRS steps deliver real value today.