genomics-pipeline

Vendor Compatibility Guide

Your genomics data can come from many different providers. This guide explains what format each vendor delivers, how to get it into the pipeline, and what to watch out for.

Quick Reference

Vendor Data Type Typical Format Genome Build Pipeline Entry Price (2025-2026)
Nebula / DNA Complete 30X WGS FASTQ + VCF GRCh38 Path A or C $495 (30X)
Dante Labs 30X WGS FASTQ + BAM + VCF GRCh38 Any $300-600
Sequencing.com 30X WGS FASTQ + BAM + VCF GRCh38 Any $399-799
Novogene / BGI 30X WGS FASTQ GRCh38 Path A $200-400
Illumina DRAGEN (clinical) 30X WGS ORA / BAM + VCF GRCh38 Path D / B / C $300-1000
Full Genomes Corporation 30X WGS BAM + VCF GRCh38 Path B or C ~$1000
Oxford Nanopore Long-read WGS POD5 + BAM GRCh38 Not supported $1000-3000
PacBio HiFi Long-read WGS HiFi BAM GRCh38 Not supported $1000-2000
23andMe Genotyping array TSV (~640K SNPs) GRCh37 Partial $79-229
AncestryDNA Genotyping array TSV (~700K SNPs) GRCh37 Partial $99-199
MyHeritage Genotyping array CSV (~643K SNPs) GRCh37 Partial $79-199

Illumina Short-Read WGS (Most Vendors)

Most consumer WGS vendors use Illumina sequencing platforms (NovaSeq 6000, NovaSeq X Plus). The data comes in standard formats that this pipeline handles directly.

What You’ll Receive

Getting Started

  1. If you have FASTQ: Copy R1 and R2 files to ${GENOME_DIR}/${SAMPLE}/fastq/. Start with step 2 (alignment).
  2. If you have BAM: Copy to ${GENOME_DIR}/${SAMPLE}/aligned/${SAMPLE}_sorted.bam. Make sure the BAM index (.bai) is present. Start with step 3 (variant calling).
  3. If you have VCF: Copy to ${GENOME_DIR}/${SAMPLE}/vcf/${SAMPLE}.vcf.gz. Make sure the index (.tbi) is present. Start with step 6 (ClinVar screen).

Nebula Genomics / DNA Complete

Nebula was acquired by ProPhase Labs and rebranded as DNA Complete. They use MGI/DNBSEQ sequencing (BGI technology), not Illumina.

What’s Different

Data Access

Download your data from the DNA Complete portal. Both FASTQ and VCF are available.

Download Window Warning: Data access requires an active subscription. If your subscription lapses, you may lose access to your raw data. Download everything immediately after receiving results. Do not assume you can come back later.

Entry Point

Use Path A (FASTQ) for the most complete analysis, or Path C (VCF) if you only want annotation.


Dante Labs

Italian company using Illumina NovaSeq. Standard Illumina output.

Known Issues

Download Window Warning: Dante Labs data downloads expire 30 days after your results are ready. After this window, the data may be archived or permanently deleted. Multiple users on Reddit have reported losing access to their raw data. Download within 30 days. If you miss the window, contact support — some users have been able to get extensions, but it is not guaranteed.

Entry Point

All three formats (FASTQ, BAM, VCF) are typically provided. Use whichever entry point matches your goals.


Sequencing.com

US-based service offering 30X WGS with a web-based analysis marketplace.

Known Issues

Download Window Warning: If you do not download your data for an extended period, Sequencing.com may archive it to cold storage. Retrieval from cold storage can take 1-3 business days. If you are planning to run this pipeline, request data unarchiving immediately after receiving results.

Entry Point

Any path, depending on what format you download (FASTQ/BAM/VCF).


Novogene / BGI Direct

Research-focused sequencing service. Cheapest option for 30X WGS (~$200-400).

What You’ll Receive

Typically FASTQ only (paired-end, gzipped). BAM and VCF may be available at extra cost or on request.

Download Window Warning: Novogene provides data via their cloud portal for a limited time (typically 90 days). After that, data may be permanently deleted. BGI Direct has similar policies. Download your FASTQ files as soon as they are available.

BGI/MGI FASTQ Quirks

BGI read names look different from Illumina:

# Illumina:
@A00123:456:HXXXXXXX:1:1101:12345:67890 1:N:0:ATCGATCG

# BGI/DNBSEQ:
@V350012345L1C001R00100000001/1

This does not affect any pipeline step. BWA, minimap2, DeepVariant, and all other tools only use the sequence and quality lines.

Entry Point

Path A (FASTQ). You’ll need to run the full pipeline from alignment.


Illumina DRAGEN (Clinical/Hospital)

If your WGS was done through a clinical lab or hospital, they likely used Illumina’s DRAGEN pipeline for processing.

ORA Format

Some labs deliver FASTQ files compressed in Illumina’s proprietary ORA format (~5x smaller than gzipped FASTQ). You need the orad decompressor:

# Step 1 in this pipeline handles ORA decompression
./scripts/01-ora-to-fastq.sh $SAMPLE

See docs/01-ora-to-fastq.md for details on obtaining the orad binary.

DRAGEN VCF Notes

DRAGEN VCFs include non-standard annotations (e.g., DRAGEN: prefixed INFO fields). These are ignored by standard tools but may cause warnings. This is harmless.

If your lab provided a DRAGEN-called VCF, you can skip steps 2-3 and go directly to analysis (Path C). However, re-calling variants with DeepVariant (step 3) from the BAM may find additional variants that DRAGEN missed, especially in difficult regions.

Entry Point


CRAM Files

Some providers deliver CRAM instead of BAM (40-60% smaller). Convert to BAM first:

docker run --rm \
  -v ${GENOME_DIR}:/genome \
  staphb/samtools:1.21 \
  samtools view -b \
    -T /genome/reference/Homo_sapiens_assembly38.fasta \
    -o /genome/${SAMPLE}/aligned/${SAMPLE}_sorted.bam \
    /genome/${SAMPLE}/aligned/${SAMPLE}.cram

# Index the BAM
docker run --rm \
  -v ${GENOME_DIR}:/genome \
  staphb/samtools:1.21 \
  samtools index /genome/${SAMPLE}/aligned/${SAMPLE}_sorted.bam

Important: CRAM decoding requires the same reference genome used for encoding. This pipeline uses Homo_sapiens_assembly38.fasta (GRCh38). If your CRAM was encoded against a different reference, you’ll get errors.


Long-Read Sequencing (Not Supported)

Oxford Nanopore (MinION / PromethION)

Nanopore produces long reads (10-50 kb average) with different error profiles than Illumina. This pipeline’s tools are optimized for short reads and will produce incorrect results with nanopore data.

What you’d need instead:

PacBio HiFi

PacBio HiFi reads are highly accurate (>Q20) and 10-20 kb long. Different tools required:

A long-read pipeline branch may be added in the future. For now, these are the recommended tools.


Genotyping Arrays (Partial Support)

23andMe, AncestryDNA, MyHeritage

These services use genotyping chips that test ~600,000-700,000 specific positions. This is not whole genome sequencing – it covers ~0.02% of your genome.

For detailed conversion instructions, which pipeline steps work, imputation guidance, and what to realistically expect from chip data, see the chip data guide.


Genome Build: GRCh37 (hg19) vs GRCh38 (hg38)

This pipeline uses GRCh38 (hg38) exclusively. If your data is on an older build:

How to Check Your Build

# For BAM files -- look at the reference in the header:
samtools view -H your_file.bam | grep "^@SQ" | head -3

# GRCh38 uses "chr" prefix: SN:chr1, SN:chr2, etc.
# GRCh37 may lack "chr" prefix: SN:1, SN:2, etc.
# (Some GRCh37 builds do use chr prefix -- check chromosome lengths to be sure)
# chr1 length: GRCh38 = 248,956,422; GRCh37 = 249,250,621

Converting from GRCh37 to GRCh38

Best approach (recommended): Extract FASTQ from BAM and re-align to GRCh38:

# Extract paired-end FASTQ from BAM
docker run --rm -v ${GENOME_DIR}:/genome staphb/samtools:1.21 \
  bash -c "samtools sort -n /genome/${SAMPLE}/old_hg19.bam | \
           samtools fastq -1 /genome/${SAMPLE}/fastq/${SAMPLE}_R1.fastq.gz \
                          -2 /genome/${SAMPLE}/fastq/${SAMPLE}_R2.fastq.gz -"

# Then run the pipeline from step 2 (alignment to GRCh38)
./scripts/02-alignment.sh $SAMPLE

Alternative (quicker but less accurate): Use Picard LiftoverVcf to convert VCF coordinates. This can introduce artifacts at complex regions and is not recommended for clinical use.


Download Checklist

Regardless of vendor, follow this checklist before starting the pipeline:

1. Download Everything Available

Even if you only plan to use VCF, download FASTQ and BAM too. Storage is cheap; re-sequencing is not.

Priority File Why
Critical FASTQ (R1 + R2) Raw data. Can regenerate everything else.
High BAM + BAI Skip alignment (saves 1-2 hours). Needed for SV calling.
Medium VCF + TBI Quick start for annotation steps.
Low Lab report PDF Reference for comparing your pipeline results
Low QC metrics Coverage stats, insert size distribution

2. Verify Download Integrity

Large files can be silently truncated during download:

# Check FASTQ integrity
gzip -t ${SAMPLE}_R1.fastq.gz && echo "R1 OK" || echo "R1 CORRUPT"
gzip -t ${SAMPLE}_R2.fastq.gz && echo "R2 OK" || echo "R2 CORRUPT"

# Check BAM integrity
docker run --rm -v ${GENOME_DIR}:/genome staphb/samtools:1.20 \
  samtools quickcheck /genome/${SAMPLE}/aligned/${SAMPLE}_sorted.bam \
  && echo "BAM OK" || echo "BAM CORRUPT"

# Check VCF integrity
docker run --rm -v ${GENOME_DIR}:/genome staphb/bcftools:1.21 \
  bcftools view -h /genome/${SAMPLE}/vcf/${SAMPLE}.vcf.gz > /dev/null \
  && echo "VCF OK" || echo "VCF CORRUPT"

If any file is corrupt, re-download with wget -c (supports resume).

3. Check Expected File Sizes

If a file is suspiciously small, the download likely failed:

File Expected Size (30X WGS) Suspicious If
FASTQ (each, gzipped) 30-45 GB < 10 GB
BAM 80-120 GB < 40 GB
VCF (bgzipped) 80-200 MB < 10 MB

4. Back Up Before Running

The pipeline creates output alongside your input data. Before running anything, copy your raw FASTQ/BAM/VCF to a separate backup location. If something goes wrong, you want to avoid re-downloading from a vendor whose download window may have expired.


File Size Reference

Know what to expect before downloading:

File Type Typical Size (30X WGS) Notes
FASTQ (gzipped, paired) 60-90 GB Two files: R1 + R2
ORA (Illumina compressed) 15-20 GB Same data as FASTQ, ~5x smaller
BAM (aligned) 80-120 GB Largest single file
CRAM (compressed aligned) 40-60 GB 40-60% smaller than BAM
VCF (variants) 80-200 MB Relatively small
gVCF (with reference blocks) 3-10 GB Much larger than VCF