Pre-alignment quality control: removes adapter sequences, trims low-quality bases, filters short reads, and removes polyG tails. Produces trimmed FASTQs plus JSON and HTML QC reports.
Raw FASTQ reads contain adapter contamination and low-quality bases that can cause:
This is particularly important for BGI/MGI (DNBseq) and Nebula Genomics data, where adapter contamination rates are higher than typical Illumina libraries. Aligners like minimap2 and BWA-MEM2 do not detect or remove adapters — they only soft-clip unmapped portions as a side effect of local alignment.
DeepVariant uses base quality as one of its 6 input channels. Adapter bases carry inflated quality scores that DeepVariant cannot distinguish from real sequence.
fastp v1.3.1 — all-in-one FASTQ preprocessor with SIMD acceleration and parallel compression.
quay.io/biocontainers/fastp:1.3.1--h43da1c4_0
export GENOME_DIR=/path/to/data
./scripts/01b-fastp-qc.sh <sample_name>
SKIP_TRIM=true ./scripts/run-all.sh <sample_name> <sex>
When skipped, the alignment step reads raw FASTQs directly (existing behavior).
fastp \
-i input_R1.fastq.gz -I input_R2.fastq.gz \
-o trimmed_R1.fastq.gz -O trimmed_R2.fastq.gz \
--detect_adapter_for_pe \ # PE overlap-based adapter detection
--qualified_quality_phred 20 \ # Bases below Q20 are "unqualified"
--cut_front --cut_tail \ # Sliding window trim from both ends
--cut_mean_quality 20 \ # Window mean quality threshold
--length_required 36 \ # Discard reads < 36bp
-g \ # PolyG tail trimming (NovaSeq/NextSeq)
-R "sample_name" \ # Report title (used by MultiQC)
-j report.json -h report.html \ # QC reports
-w 8 # Worker threads
For paired-end data, fastp uses overlap analysis to detect adapters: if the overlap region of R1 and R2 extends beyond the insert, the non-overlapping tails are adapter sequence. This works for any adapter type without needing to specify sequences.
Additionally, fastp has 170+ built-in adapter sequences including both Illumina TruSeq and BGI/MGI adapters in its known adapters database. No platform-specific flags are needed.
fastp’s known adapters include the MGI/BGI forward (AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA) and reverse (AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG) sequences. Auto-detection handles DNBseq data without additional configuration.
| File | Location | Description |
|---|---|---|
| Trimmed R1 | fastq_trimmed/<sample>_R1.fastq.gz |
Quality-filtered, adapter-trimmed read 1 |
| Trimmed R2 | fastq_trimmed/<sample>_R2.fastq.gz |
Quality-filtered, adapter-trimmed read 2 |
| JSON report | fastq_trimmed/<sample>_fastp.json |
Machine-readable QC (consumed by MultiQC) |
| HTML report | fastq_trimmed/<sample>_fastp.html |
Visual QC report (open in browser) |
| Dataset | Threads | Time | Memory |
|---|---|---|---|
| 30X WGS (~60 GB FASTQ) | 8 | ~10-20 min | < 2 GB |
| chr22 test data | 4 | < 1 min | < 1 GB |
fastp is I/O-bound, not CPU-bound. Threads beyond 8-16 provide minimal improvement.
Open <sample>_fastp.html in a browser to see:
fastp’s JSON report is automatically detected by MultiQC. The -R flag sets the sample name shown in the aggregated report. The JSON file must contain "before_filtering": { to be recognized.
.gz--detect_adapter_for_pe flag is disabled by default in fastp for PE data — this script enables it explicitly for thorough adapter removalfastq_trimmed/ directory first