Personal-Genome-Pipeline

Step 1b: fastp QC + Adapter Trimming

Pre-alignment quality control: removes adapter sequences, trims low-quality bases, filters short reads, and removes polyG tails. Produces trimmed FASTQs plus JSON and HTML QC reports.


What It Does

  1. Adapter removal — auto-detects and trims Illumina TruSeq, Nextera, BGI/MGI, and 170+ other adapter sequences using both sequence matching and PE overlap analysis
  2. Quality trimming — slides a 4bp window from both read ends, trimming bases with mean quality below Q20
  3. PolyG tail removal — strips artificial polyG tails generated by NovaSeq/NextSeq two-color chemistry
  4. Length filtering — discards reads shorter than 36bp after trimming
  5. QC reporting — generates per-base quality curves, adapter content, GC distribution, duplication estimates, and insert size distribution (before + after filtering)

Why

Raw FASTQ reads contain adapter contamination and low-quality bases that can cause:

This is particularly important for BGI/MGI (DNBseq) and Nebula Genomics data, where adapter contamination rates are higher than typical Illumina libraries. Aligners like minimap2 and BWA-MEM2 do not detect or remove adapters — they only soft-clip unmapped portions as a side effect of local alignment.

DeepVariant uses base quality as one of its 6 input channels. Adapter bases carry inflated quality scores that DeepVariant cannot distinguish from real sequence.

Tool

fastp v1.3.1 — all-in-one FASTQ preprocessor with SIMD acceleration and parallel compression.

Docker Image

quay.io/biocontainers/fastp:1.3.1--h43da1c4_0

Command

export GENOME_DIR=/path/to/data
./scripts/01b-fastp-qc.sh <sample_name>

Skip trimming

SKIP_TRIM=true ./scripts/run-all.sh <sample_name> <sex>

When skipped, the alignment step reads raw FASTQs directly (existing behavior).

What Happens Inside

fastp \
  -i input_R1.fastq.gz -I input_R2.fastq.gz \
  -o trimmed_R1.fastq.gz -O trimmed_R2.fastq.gz \
  --detect_adapter_for_pe \       # PE overlap-based adapter detection
  --qualified_quality_phred 20 \  # Bases below Q20 are "unqualified"
  --cut_front --cut_tail \        # Sliding window trim from both ends
  --cut_mean_quality 20 \         # Window mean quality threshold
  --length_required 36 \          # Discard reads < 36bp
  -g \                            # PolyG tail trimming (NovaSeq/NextSeq)
  -R "sample_name" \              # Report title (used by MultiQC)
  -j report.json -h report.html \ # QC reports
  -w 8                            # Worker threads

Adapter detection

For paired-end data, fastp uses overlap analysis to detect adapters: if the overlap region of R1 and R2 extends beyond the insert, the non-overlapping tails are adapter sequence. This works for any adapter type without needing to specify sequences.

Additionally, fastp has 170+ built-in adapter sequences including both Illumina TruSeq and BGI/MGI adapters in its known adapters database. No platform-specific flags are needed.

BGI/MGI note

fastp’s known adapters include the MGI/BGI forward (AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA) and reverse (AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG) sequences. Auto-detection handles DNBseq data without additional configuration.

Output

File Location Description
Trimmed R1 fastq_trimmed/<sample>_R1.fastq.gz Quality-filtered, adapter-trimmed read 1
Trimmed R2 fastq_trimmed/<sample>_R2.fastq.gz Quality-filtered, adapter-trimmed read 2
JSON report fastq_trimmed/<sample>_fastp.json Machine-readable QC (consumed by MultiQC)
HTML report fastq_trimmed/<sample>_fastp.html Visual QC report (open in browser)

Runtime

Dataset Threads Time Memory
30X WGS (~60 GB FASTQ) 8 ~10-20 min < 2 GB
chr22 test data 4 < 1 min < 1 GB

fastp is I/O-bound, not CPU-bound. Threads beyond 8-16 provide minimal improvement.

Interpreting the Reports

HTML report

Open <sample>_fastp.html in a browser to see:

What to look for

MultiQC Integration

fastp’s JSON report is automatically detected by MultiQC. The -R flag sets the sample name shown in the aggregated report. The JSON file must contain "before_filtering": { to be recognized.

Notes