Personal-Genome-Pipeline

Step 1b: fastp QC + Adapter Trimming

Pre-alignment quality control: removes adapter sequences, trims low-quality bases, filters short reads, and removes polyG tails. Produces trimmed FASTQs plus JSON and HTML QC reports.

What It Does

Adapter removal — auto-detects and trims Illumina TruSeq, Nextera, BGI/MGI, and 170+ other adapter sequences using both sequence matching and PE overlap analysis
Quality trimming — slides a 4bp window from both read ends, trimming bases with mean quality below Q20
PolyG tail removal — strips artificial polyG tails generated by NovaSeq/NextSeq two-color chemistry
Length filtering — discards reads shorter than 36bp after trimming
QC reporting — generates per-base quality curves, adapter content, GC distribution, duplication estimates, and insert size distribution (before + after filtering)

Why

Raw FASTQ reads contain adapter contamination and low-quality bases that can cause:

False variant calls — adapter sequences misalign to the reference and look like mutations
Reduced mapping rates — reads with adapter tails may not align or align incorrectly
Inflated duplication — adapter-contaminated reads cluster together artificially

This is particularly important for BGI/MGI (DNBseq) and Nebula Genomics data, where adapter contamination rates are higher than typical Illumina libraries. Aligners like minimap2 and BWA-MEM2 do not detect or remove adapters — they only soft-clip unmapped portions as a side effect of local alignment.

DeepVariant uses base quality as one of its 6 input channels. Adapter bases carry inflated quality scores that DeepVariant cannot distinguish from real sequence.

Tool

fastp v1.3.1 — all-in-one FASTQ preprocessor with SIMD acceleration and parallel compression.

Paper: Chen et al., Bioinformatics 2018 (doi:10.1093/bioinformatics/bty560)
Source: github.com/OpenGene/fastp

Docker Image

quay.io/biocontainers/fastp:1.3.1--h43da1c4_0

Command

export GENOME_DIR=/path/to/data
./scripts/01b-fastp-qc.sh <sample_name>

Skip trimming

SKIP_TRIM=true ./scripts/run-all.sh <sample_name> <sex>

When skipped, the alignment step reads raw FASTQs directly (existing behavior).

What Happens Inside

fastp \
  -i input_R1.fastq.gz -I input_R2.fastq.gz \
  -o trimmed_R1.fastq.gz -O trimmed_R2.fastq.gz \
  --detect_adapter_for_pe \       # PE overlap-based adapter detection
  --qualified_quality_phred 20 \  # Bases below Q20 are "unqualified"
  --cut_front --cut_tail \        # Sliding window trim from both ends
  --cut_mean_quality 20 \         # Window mean quality threshold
  --length_required 36 \          # Discard reads < 36bp
  -g \                            # PolyG tail trimming (NovaSeq/NextSeq)
  -R "sample_name" \              # Report title (used by MultiQC)
  -j report.json -h report.html \ # QC reports
  -w 8                            # Worker threads

Adapter detection

For paired-end data, fastp uses overlap analysis to detect adapters: if the overlap region of R1 and R2 extends beyond the insert, the non-overlapping tails are adapter sequence. This works for any adapter type without needing to specify sequences.

Additionally, fastp has 170+ built-in adapter sequences including both Illumina TruSeq and BGI/MGI adapters in its known adapters database. No platform-specific flags are needed.

BGI/MGI note

fastp’s known adapters include the MGI/BGI forward (AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA) and reverse (AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG) sequences. Auto-detection handles DNBseq data without additional configuration.

Output

File	Location	Description
Trimmed R1	`fastq_trimmed/<sample>_R1.fastq.gz`	Quality-filtered, adapter-trimmed read 1
Trimmed R2	`fastq_trimmed/<sample>_R2.fastq.gz`	Quality-filtered, adapter-trimmed read 2
JSON report	`fastq_trimmed/<sample>_fastp.json`	Machine-readable QC (consumed by MultiQC)
HTML report	`fastq_trimmed/<sample>_fastp.html`	Visual QC report (open in browser)

Runtime

Dataset	Threads	Time	Memory
30X WGS (~60 GB FASTQ)	8	~10-20 min	< 2 GB
chr22 test data	4	< 1 min	< 1 GB

fastp is I/O-bound, not CPU-bound. Threads beyond 8-16 provide minimal improvement.

Interpreting the Reports

HTML report

Open <sample>_fastp.html in a browser to see:

Before/after filtering summary — total reads, bases, Q20/Q30 rates
Quality score distribution — per-base quality across read positions
Base content — A/T/G/C/N proportions (should be ~25% each, flat across positions)
Adapter content — fraction of reads with detected adapters
Insert size distribution — fragment length peak (typically 300-500bp for WGS)
Duplication rate — estimated from the first 1M reads

What to look for

Adapter rate > 5%: Normal for some library preps, but indicates trimming is important
Q30 rate < 80% before filtering: Below-average sequencing quality
PolyG spike at read ends: Common on NovaSeq — fastp removes these automatically
Uneven base content at read starts: First 10-15bp often show bias from random hexamer priming (normal, not actionable)

MultiQC Integration

fastp’s JSON report is automatically detected by MultiQC. The -R flag sets the sample name shown in the aggregated report. The JSON file must contain "before_filtering": { to be recognized.

Notes

fastp outputs gzip-compressed FASTQs when the output filename ends in .gz
The --detect_adapter_for_pe flag is disabled by default in fastp for PE data — this script enables it explicitly for thorough adapter removal
If you need to re-run fastp, delete the fastq_trimmed/ directory first
fastp is designed for short reads (Illumina, BGI/MGI). For long reads (ONT/PacBio), use platform-specific QC tools instead

This site is open source. Improve this page.