Your genomics data can come from many different providers. This guide explains what format each vendor delivers, how to get it into the pipeline, and what to watch out for.
| Vendor | Data Type | Typical Format | Genome Build | Pipeline Entry | Price (2025-2026) |
|---|---|---|---|---|---|
| Nebula / DNA Complete | 30X WGS | FASTQ + VCF | GRCh38 | Path A or C | $495 (30X) |
| Dante Labs | 30X WGS | FASTQ + BAM + VCF | GRCh38 | Any | $300-600 |
| Sequencing.com | 30X WGS | FASTQ + BAM + VCF | GRCh38 | Any | $399-799 |
| Novogene / BGI | 30X WGS | FASTQ | GRCh38 | Path A | $200-400 |
| Illumina DRAGEN (clinical) | 30X WGS | ORA / BAM + VCF | GRCh38 | Path D / B / C | $300-1000 |
| Full Genomes Corporation | 30X WGS | BAM + VCF | GRCh38 | Path B or C | ~$1000 |
| Oxford Nanopore | Long-read WGS | POD5 + BAM | GRCh38 | Not supported | $1000-3000 |
| PacBio HiFi | Long-read WGS | HiFi BAM | GRCh38 | Not supported | $1000-2000 |
| 23andMe | Genotyping array | TSV (~640K SNPs) | GRCh37 | Partial | $79-229 |
| AncestryDNA | Genotyping array | TSV (~700K SNPs) | GRCh37 | Partial | $99-199 |
| MyHeritage | Genotyping array | CSV (~643K SNPs) | GRCh37 | Partial | $79-199 |
Most consumer WGS vendors use Illumina sequencing platforms (NovaSeq 6000, NovaSeq X Plus). The data comes in standard formats that this pipeline handles directly.
.fastq.gz): Raw sequencing reads, paired-end (R1 + R2). Typically 60-90 GB compressed per sample..bam): Aligned reads. 80-120 GB per sample. Already aligned to GRCh38..vcf.gz): Variant calls. 80-200 MB per sample. ~4.5-5.5 million variants.${GENOME_DIR}/${SAMPLE}/fastq/. Start with step 2 (alignment).${GENOME_DIR}/${SAMPLE}/aligned/${SAMPLE}_sorted.bam. Make sure the BAM index (.bai) is present. Start with step 3 (variant calling).${GENOME_DIR}/${SAMPLE}/vcf/${SAMPLE}.vcf.gz. Make sure the index (.tbi) is present. Start with step 6 (ClinVar screen).Nebula was acquired by ProPhase Labs and rebranded as DNA Complete. They use MGI/DNBSEQ sequencing (BGI technology), not Illumina.
Download your data from the DNA Complete portal. Both FASTQ and VCF are available.
Download Window Warning: Data access requires an active subscription. If your subscription lapses, you may lose access to your raw data. Download everything immediately after receiving results. Do not assume you can come back later.
Use Path A (FASTQ) for the most complete analysis, or Path C (VCF) if you only want annotation.
Italian company using Illumina NovaSeq. Standard Illumina output.
Download Window Warning: Dante Labs data downloads expire 30 days after your results are ready. After this window, the data may be archived or permanently deleted. Multiple users on Reddit have reported losing access to their raw data. Download within 30 days. If you miss the window, contact support — some users have been able to get extensions, but it is not guaranteed.
All three formats (FASTQ, BAM, VCF) are typically provided. Use whichever entry point matches your goals.
US-based service offering 30X WGS with a web-based analysis marketplace.
Download Window Warning: If you do not download your data for an extended period, Sequencing.com may archive it to cold storage. Retrieval from cold storage can take 1-3 business days. If you are planning to run this pipeline, request data unarchiving immediately after receiving results.
Any path, depending on what format you download (FASTQ/BAM/VCF).
Research-focused sequencing service. Cheapest option for 30X WGS (~$200-400).
Typically FASTQ only (paired-end, gzipped). BAM and VCF may be available at extra cost or on request.
Download Window Warning: Novogene provides data via their cloud portal for a limited time (typically 90 days). After that, data may be permanently deleted. BGI Direct has similar policies. Download your FASTQ files as soon as they are available.
BGI read names look different from Illumina:
# Illumina:
@A00123:456:HXXXXXXX:1:1101:12345:67890 1:N:0:ATCGATCG
# BGI/DNBSEQ:
@V350012345L1C001R00100000001/1
This does not affect any pipeline step. BWA, minimap2, DeepVariant, and all other tools only use the sequence and quality lines.
Path A (FASTQ). You’ll need to run the full pipeline from alignment.
If your WGS was done through a clinical lab or hospital, they likely used Illumina’s DRAGEN pipeline for processing.
Some labs deliver FASTQ files compressed in Illumina’s proprietary ORA format (~5x smaller than gzipped FASTQ). You need the orad decompressor:
# Step 1 in this pipeline handles ORA decompression
./scripts/01-ora-to-fastq.sh $SAMPLE
See docs/01-ora-to-fastq.md for details on obtaining the orad binary.
DRAGEN VCFs include non-standard annotations (e.g., DRAGEN: prefixed INFO fields). These are ignored by standard tools but may cause warnings. This is harmless.
If your lab provided a DRAGEN-called VCF, you can skip steps 2-3 and go directly to analysis (Path C). However, re-calling variants with DeepVariant (step 3) from the BAM may find additional variants that DRAGEN missed, especially in difficult regions.
Some providers deliver CRAM instead of BAM (40-60% smaller). Convert to BAM first:
docker run --rm \
-v ${GENOME_DIR}:/genome \
staphb/samtools:1.21 \
samtools view -b \
-T /genome/reference/Homo_sapiens_assembly38.fasta \
-o /genome/${SAMPLE}/aligned/${SAMPLE}_sorted.bam \
/genome/${SAMPLE}/aligned/${SAMPLE}.cram
# Index the BAM
docker run --rm \
-v ${GENOME_DIR}:/genome \
staphb/samtools:1.21 \
samtools index /genome/${SAMPLE}/aligned/${SAMPLE}_sorted.bam
Important: CRAM decoding requires the same reference genome used for encoding. This pipeline uses Homo_sapiens_assembly38.fasta (GRCh38). If your CRAM was encoded against a different reference, you’ll get errors.
Nanopore produces long reads (10-50 kb average) with different error profiles than Illumina. This pipeline’s tools are optimized for short reads and will produce incorrect results with nanopore data.
What you’d need instead:
minimap2 -ax map-ont (not the default short-read preset)PacBio HiFi reads are highly accurate (>Q20) and 10-20 kb long. Different tools required:
pbmm2 or minimap2 -ax map-hifipbsv or Sniffles2A long-read pipeline branch may be added in the future. For now, these are the recommended tools.
These services use genotyping chips that test ~600,000-700,000 specific positions. This is not whole genome sequencing – it covers ~0.02% of your genome.
For detailed conversion instructions, which pipeline steps work, imputation guidance, and what to realistically expect from chip data, see the chip data guide.
This pipeline uses GRCh38 (hg38) exclusively. If your data is on an older build:
# For BAM files -- look at the reference in the header:
samtools view -H your_file.bam | grep "^@SQ" | head -3
# GRCh38 uses "chr" prefix: SN:chr1, SN:chr2, etc.
# GRCh37 may lack "chr" prefix: SN:1, SN:2, etc.
# (Some GRCh37 builds do use chr prefix -- check chromosome lengths to be sure)
# chr1 length: GRCh38 = 248,956,422; GRCh37 = 249,250,621
Best approach (recommended): Extract FASTQ from BAM and re-align to GRCh38:
# Extract paired-end FASTQ from BAM
docker run --rm -v ${GENOME_DIR}:/genome staphb/samtools:1.21 \
bash -c "samtools sort -n /genome/${SAMPLE}/old_hg19.bam | \
samtools fastq -1 /genome/${SAMPLE}/fastq/${SAMPLE}_R1.fastq.gz \
-2 /genome/${SAMPLE}/fastq/${SAMPLE}_R2.fastq.gz -"
# Then run the pipeline from step 2 (alignment to GRCh38)
./scripts/02-alignment.sh $SAMPLE
Alternative (quicker but less accurate): Use Picard LiftoverVcf to convert VCF coordinates. This can introduce artifacts at complex regions and is not recommended for clinical use.
Regardless of vendor, follow this checklist before starting the pipeline:
Even if you only plan to use VCF, download FASTQ and BAM too. Storage is cheap; re-sequencing is not.
| Priority | File | Why |
|---|---|---|
| Critical | FASTQ (R1 + R2) | Raw data. Can regenerate everything else. |
| High | BAM + BAI | Skip alignment (saves 1-2 hours). Needed for SV calling. |
| Medium | VCF + TBI | Quick start for annotation steps. |
| Low | Lab report PDF | Reference for comparing your pipeline results |
| Low | QC metrics | Coverage stats, insert size distribution |
Large files can be silently truncated during download:
# Check FASTQ integrity
gzip -t ${SAMPLE}_R1.fastq.gz && echo "R1 OK" || echo "R1 CORRUPT"
gzip -t ${SAMPLE}_R2.fastq.gz && echo "R2 OK" || echo "R2 CORRUPT"
# Check BAM integrity
docker run --rm -v ${GENOME_DIR}:/genome staphb/samtools:1.20 \
samtools quickcheck /genome/${SAMPLE}/aligned/${SAMPLE}_sorted.bam \
&& echo "BAM OK" || echo "BAM CORRUPT"
# Check VCF integrity
docker run --rm -v ${GENOME_DIR}:/genome staphb/bcftools:1.21 \
bcftools view -h /genome/${SAMPLE}/vcf/${SAMPLE}.vcf.gz > /dev/null \
&& echo "VCF OK" || echo "VCF CORRUPT"
If any file is corrupt, re-download with wget -c (supports resume).
If a file is suspiciously small, the download likely failed:
| File | Expected Size (30X WGS) | Suspicious If |
|---|---|---|
| FASTQ (each, gzipped) | 30-45 GB | < 10 GB |
| BAM | 80-120 GB | < 40 GB |
| VCF (bgzipped) | 80-200 MB | < 10 MB |
The pipeline creates output alongside your input data. Before running anything, copy your raw FASTQ/BAM/VCF to a separate backup location. If something goes wrong, you want to avoid re-downloading from a vendor whose download window may have expired.
Know what to expect before downloading:
| File Type | Typical Size (30X WGS) | Notes |
|---|---|---|
| FASTQ (gzipped, paired) | 60-90 GB | Two files: R1 + R2 |
| ORA (Illumina compressed) | 15-20 GB | Same data as FASTQ, ~5x smaller |
| BAM (aligned) | 80-120 GB | Largest single file |
| CRAM (compressed aligned) | 40-60 GB | 40-60% smaller than BAM |
| VCF (variants) | 80-200 MB | Relatively small |
| gVCF (with reference blocks) | 3-10 GB | Much larger than VCF |