Standard WGS (from Illumina, Novogene, Dante Labs, etc.) produces short reads — millions of DNA fragments typically 150 base pairs long. Two companies offer a fundamentally different approach: reading much longer stretches of DNA in a single pass.
ONT sequencers (MinION, PromethION, GridION) push a strand of DNA through a tiny protein pore. As each nucleotide passes through, it changes the electrical current. The device reads these current changes in real time to determine the sequence.
PacBio sequencers (Sequel II, Revio) use a different approach: a polymerase enzyme copies a circular DNA template while a camera records fluorescent signals from each incorporated nucleotide. By reading the same circle multiple times, errors cancel out.
Long reads solve problems that short reads cannot:
| Capability | Short Reads (150bp) | Long Reads (10-100Kbp) |
|---|---|---|
| SNPs and small indels | Excellent | Excellent (HiFi) / Good (ONT) |
| Structural variants (>50bp) | Limited | Excellent |
| Repeat expansions | Indirect (ExpansionHunter) | Direct measurement |
| Phasing (which allele is on which chromosome) | Statistical | Read-based, much more accurate |
| Segmental duplications / tandem repeats | Mostly invisible | Resolvable |
| Methylation | Requires bisulfite conversion | Native (ONT) |
Not every step in this pipeline works with long-read data. Here is the full breakdown.
These steps take a VCF or BAM and work identically regardless of read technology:
| Step | Tool | Notes |
|---|---|---|
| 5 | AnnotSV | Takes any SV VCF — point it at Sniffles2 output |
| 6 | ClinVar Screen | VCF-only, technology-independent |
| 7 | PharmCAT | VCF-only — requires VCF + .tbi symlinked to vcf/ (see below) |
| 11 | ROH Analysis | VCF-only via plink2 |
| 12 | Mito Haplogroup | VCF-only via haplogrep3 |
| 13 | VEP Annotation | VCF-only |
| 14 | Imputation Prep | VCF-only |
| 17 | CPSR | VCF-only |
| 22 | SURVIVOR Merge | Takes any SV VCF set |
| 23 | Clinical Filter | VCF-only |
| 24 | HTML Report | Aggregates existing outputs |
| 25 | PRS | VCF-only via plink2 |
| 26 | Ancestry | VCF-only via plink2 |
| 27 | CPIC Lookup | Reads PharmCAT JSON output |
| Step | Short-Read Tool | Long-Read Alternative | Script |
|---|---|---|---|
| 2 | minimap2 (-x sr) |
minimap2 (-x map-ont / -x map-hifi) |
scripts/02b-alignment-longread.sh |
| 3 | DeepVariant (WGS model) | Clair3 or DeepVariant (ONT/PACBIO model) | scripts/03e-clair3.sh |
| 4 | Manta | Sniffles2 | scripts/04c-sniffles2.sh |
These tools are specifically designed for short-read data and will produce incorrect results or crash with long-read input:
| Step | Tool | Why It Fails | Alternative |
|---|---|---|---|
| 1b | fastp | Expects paired-end Illumina reads; no long-read mode | NanoPlot, LongQC, or chopper for long-read QC |
| 2a | BWA-MEM2 | Short-read aligner only; cannot handle reads >500bp | Use minimap2 with long-read preset |
| 3a | GATK HaplotypeCaller | Active-region model assumes short reads; poor with long reads | Clair3 or DeepVariant long-read mode |
| 3b | FreeBayes | Bayesian model tuned for short reads | Clair3 or DeepVariant long-read mode |
| 3c | Strelka2 | Scoring model trained on BWA-MEM short-read data | Clair3 or DeepVariant long-read mode |
| 3d | Octopus | Haplotype model designed for short reads | Clair3 or DeepVariant long-read mode |
| 4 | Manta | Illumina-specific insert size model | Sniffles2 |
| 4a | TIDDIT | Short-read coverage/discordance model | Sniffles2 |
| 4b | GRIDSS | Assembly-based, short-read specific | Sniffles2 |
| 9 | ExpansionHunter | Illumina short-read graph model; expects paired-end data | TRGT (PacBio), STRique (ONT), or direct long-read spanning |
| 15 | duphold | Re-genotypes SVs using short-read depth models | Not needed — Sniffles2 QUAL scores are reliable |
| 18 | CNVnator | Read-depth model calibrated for short reads | Sniffles2 detects CNVs natively |
| 19 | Delly | Paired-end and split-read model | Sniffles2 |
| 21 | Cyrius (CYP2D6) | Short-read depth-based star allele caller | Paraphase (long-read CYP2D6 resolver) |
Same reference data as the short-read pipeline:
export GENOME_DIR=/path/to/your/data
# Reference genome and index should already exist from initial setup
# See docs/00-reference-setup.md
Place your long-read data in the sample directory:
# Create sample directory
mkdir -p ${GENOME_DIR}/your_sample/fastq/
# Copy or symlink your long-read FASTQ
# ONT: typically one FASTQ per run (basecalled from FAST5/POD5)
cp /path/to/ont_reads.fastq.gz ${GENOME_DIR}/your_sample/fastq/your_sample.fastq.gz
# PacBio: typically one FASTQ or unaligned BAM per cell
cp /path/to/hifi_reads.fastq.gz ${GENOME_DIR}/your_sample/fastq/your_sample.fastq.gz
# OR for unaligned BAM:
cp /path/to/hifi_reads.bam ${GENOME_DIR}/your_sample/fastq/your_sample.bam
Note: long-read data is single-end (no R1/R2 pairs). The alignment script expects a single file, not paired files.
# For Oxford Nanopore
PLATFORM=ont ./scripts/02b-alignment-longread.sh your_sample
# For PacBio HiFi
PLATFORM=hifi ./scripts/02b-alignment-longread.sh your_sample
Output: ${GENOME_DIR}/your_sample/aligned_longread/your_sample_sorted.bam
The script auto-detects input files in this order:
fastq/your_sample.fastq.gzfastq/your_sample_lr.fastq.gzfastq/your_sample.fq.gzfastq/your_sample.bam (unaligned BAM)Override with: INPUT=${GENOME_DIR}/your_sample/fastq/custom_reads.fastq.gz PLATFORM=ont ./scripts/02b-alignment-longread.sh your_sample
Note: The INPUT path must be physically inside
GENOME_DIRbecause Docker only mounts that directory. Symlinks that point outsideGENOME_DIRwill not work — Docker cannot follow them. Copy or move the file instead.
You have two options:
# For ONT
PLATFORM=ont ./scripts/03e-clair3.sh your_sample
# For PacBio HiFi
PLATFORM=hifi ./scripts/03e-clair3.sh your_sample
Output: ${GENOME_DIR}/your_sample/vcf_clair3/your_sample.vcf.gz
Clair3 uses deep-learning models trained specifically on long-read error profiles. It auto-selects the correct model based on your PLATFORM setting.
DeepVariant 1.6.0+ also supports long reads with dedicated models:
SAMPLE=your_sample
docker run --rm \
--cpus 8 --memory 32g \
-v "${GENOME_DIR}:/genome" \
google/deepvariant:1.6.0 \
/opt/deepvariant/bin/run_deepvariant \
--model_type=ONT_R104 \
--ref="/genome/reference/Homo_sapiens_assembly38.fasta" \
--reads="/genome/${SAMPLE}/aligned_longread/${SAMPLE}_sorted.bam" \
--output_vcf="/genome/${SAMPLE}/vcf/${SAMPLE}.vcf.gz" \
--num_shards=8
# For PacBio, use --model_type=PACBIO instead of ONT_R104
Clair3 vs DeepVariant for long reads:
| Aspect | Clair3 | DeepVariant |
|---|---|---|
| ONT accuracy | Excellent (purpose-built) | Good (newer models improving) |
| PacBio HiFi accuracy | Excellent | Excellent |
| Speed | Generally faster | Slower (more compute-heavy) |
| Indel calling | Very good | Very good |
| Community support | Active (HKU) | Active (Google) |
| Model updates | Frequent (matches new chemistries) | Periodic |
For PacBio HiFi data, both tools perform comparably. For ONT data, Clair3 tends to have an edge because its models are updated more frequently to match new ONT basecalling improvements.
# Default: uses aligned_longread/ directory
./scripts/04c-sniffles2.sh your_sample
# Or explicitly:
ALIGN_DIR=aligned_longread ./scripts/04c-sniffles2.sh your_sample
Output: ${GENOME_DIR}/your_sample/sv_sniffles/your_sample_sv.vcf.gz
Sniffles2 excels at detecting all SV types from long reads. Typical output for a 30X long-read genome:
After alignment and variant calling, most downstream steps work normally. Point them at the long-read outputs:
SAMPLE=your_sample
# ClinVar screen (uses VCF from Clair3)
VCF_DIR=vcf_clair3 ./scripts/06-clinvar-screen.sh "$SAMPLE"
# PharmCAT (uses VCF from Clair3)
# PharmCAT needs the VCF + .tbi index at the default vcf/ location.
# Symlink or copy both files:
mkdir -p "${GENOME_DIR}/${SAMPLE}/vcf"
ln -sf "../vcf_clair3/${SAMPLE}.vcf.gz" "${GENOME_DIR}/${SAMPLE}/vcf/${SAMPLE}.vcf.gz"
ln -sf "../vcf_clair3/${SAMPLE}.vcf.gz.tbi" "${GENOME_DIR}/${SAMPLE}/vcf/${SAMPLE}.vcf.gz.tbi"
./scripts/07-pharmacogenomics.sh "$SAMPLE"
# VEP annotation (uses VCF from Clair3)
VCF_DIR=vcf_clair3 ./scripts/13-vep-annotation.sh "$SAMPLE"
# Annotate Sniffles2 SVs
SV_VCF="${GENOME_DIR}/${SAMPLE}/sv_sniffles/${SAMPLE}_sv.vcf.gz" ./scripts/05-annotsv.sh "$SAMPLE"
| Tool | Image | Size |
|---|---|---|
| minimap2 | quay.io/biocontainers/minimap2:2.28--he4a0461_0 |
~30 MB |
| samtools | staphb/samtools:1.20 |
~200 MB |
| Clair3 | hkubal/clair3:v2.0.0 |
~3 GB (includes all models) |
| Sniffles2 | quay.io/biocontainers/sniffles:2.4--pyhdfd78af_0 |
~200 MB |
| DeepVariant | google/deepvariant:1.6.0 |
~5 GB |
Pre-pull images before your first run:
docker pull quay.io/biocontainers/minimap2:2.28--he4a0461_0
docker pull staphb/samtools:1.20
docker pull hkubal/clair3:v2.0.0
docker pull quay.io/biocontainers/sniffles:2.4--pyhdfd78af_0
ONT raw data (FAST5/POD5) must be basecalled before entering this pipeline. Use the latest Dorado basecaller with the super accuracy (SUP) model for best variant calling:
# This happens BEFORE the pipeline — on a GPU machine
dorado basecaller sup pod5_dir/ > reads.bam
dorado demux reads.bam # if barcoded
The Clair3 model (r1041_e82_400bps_sup_v500) assumes R10.4.1 chemistry with SUP basecalling. Using a different chemistry or basecaller may reduce accuracy.
ONT generates a wide range of read lengths. Very short reads (<1kb) add noise. If your N50 is below 5kb, consider filtering:
# Filter reads shorter than 1kb (optional, before alignment)
# Use chopper or NanoFilt
ONT natively detects methylation (5mC, 6mA) during basecalling with Dorado. Methylation tags are stored in the BAM as MM/ML tags. This pipeline does not currently process methylation data, but the aligned BAM preserves these tags for future use.
This pipeline supports HiFi (CCS) reads only, not older CLR (continuous long reads). HiFi reads are generated by PacBio Sequel II, Sequel IIe, and Revio systems using circular consensus sequencing.
If you received CLR data, you’ll need to generate CCS reads first using ccs or pbccs (PacBio tool, outside this pipeline).
PacBio HiFi BAMs may contain kinetic modification tags (fi, fp, ri, rp) for methylation detection. Like ONT methylation, these are preserved in alignment but not processed by this pipeline.
If you have both short-read and long-read data for the same sample, you can run both pipelines and combine results:
02-alignment.sh + 03-deepvariant.sh + 04-manta.sh02b-alignment-longread.sh + 03e-clair3.sh + 04c-sniffles2.shbenchmark-variants.sh to compare, then take the intersection for high-confidence calls22-survivor-merge.sh with both Manta and Sniffles2 VCFs for consensus SVsThis is the gold standard approach when both data types are available.
Long-read alignment uses more memory than short-read because the index is loaded into RAM. Increase Docker memory limit:
# In the script, change --memory 16g to --memory 24g or higher
The Docker image hkubal/clair3:v2.0.0 bundles models at /opt/models/. If you get a model-not-found error:
docker images | grep clair3docker run --rm hkubal/clair3:v2.0.0 ls /opt/models/samtools stats your.bam | grep "average length"--model_type=ONT_R104 for ONT (not ONT alone)--model_type=PACBIO for PacBio HiFiSteps 6 (ClinVar), 13 (VEP), and 17 (CPSR) accept a VCF_DIR env var:
VCF_DIR=vcf_clair3 ./scripts/06-clinvar-screen.sh your_sample
VCF_DIR=vcf_clair3 ./scripts/13-vep-annotation.sh your_sample
For steps that don’t support VCF_DIR (e.g., PharmCAT), symlink the Clair3 VCF:
ln -s ../vcf_clair3/${SAMPLE}.vcf.gz ${GENOME_DIR}/${SAMPLE}/vcf/${SAMPLE}.vcf.gz
ln -s ../vcf_clair3/${SAMPLE}.vcf.gz.tbi ${GENOME_DIR}/${SAMPLE}/vcf/${SAMPLE}.vcf.gz.tbi
Rough numbers for a 30X human genome:
| Metric | Short-Read (Illumina) | ONT (R10.4.1 SUP) | PacBio HiFi (Revio) |
|---|---|---|---|
| SNPs | ~4.5M | ~4.3M | ~4.5M |
| Indels | ~1.0M | ~0.8M | ~1.0M |
| PASS rate | ~93% | ~85-90% | ~93% |
| SVs (total) | ~7-9K (Manta) | ~20-25K (Sniffles2) | ~18-22K (Sniffles2) |
| SV insertions | ~500 (poor) | ~8-10K (excellent) | ~7-9K (excellent) |
| Alignment time | 1-2 hours | 1-3 hours | 1-2 hours |
| Variant calling time | 2-4 hours (DV) | 2-4 hours (Clair3) | 2-4 hours (Clair3) |
The biggest advantage of long reads is SV detection — especially insertions, which short reads systematically miss.