EXPERIMENTAL: This step uses a heuristic position-binning approach that may over-count calls from the same caller. Results should be treated as a rough intersection, not a true consensus merge. For production use, consider SURVIVOR or Jasmine with proper multi-sample VCF merging.
Performs a rough intersection of structural variant (SV) calls from multiple independent callers — Manta (step 4), Delly (step 19), and CNVnator (step 18). SVs are binned by chromosome, position (1 kb windows), and SV type; bins with calls from two or more callers are retained. This is an approximation, not a true breakpoint-aware merge like SURVIVOR or Jasmine would produce.
Individual SV callers each have distinct biases and false-positive profiles:
Taking the intersection across callers reduces false positives. An SV seen by two independent algorithms using different signal types is more likely to be real. Note that dedicated SV comparison tools (SURVIVOR, Jasmine) use breakpoint distance, size similarity, and strand matching for more accurate merging than the position-binning heuristic used here.
The script uses a breakpoint-binning approach with bcftools rather than SURVIVOR, since SURVIVOR Docker image availability is unreliable. SVs are grouped by chromosome, binned position (1 kb windows), and SV type. Bins with calls from 2+ callers are kept.
staphb/bcftools:1.21
At least two of the following (the script auto-detects which are available):
| Caller | Expected path |
|---|---|
| Manta (step 4) | ${GENOME_DIR}/${SAMPLE}/manta/results/variants/diploidSV.vcf.gz |
| Delly (step 19) | ${GENOME_DIR}/${SAMPLE}/delly/${SAMPLE}_sv.vcf.gz |
| CNVnator (step 18) | ${GENOME_DIR}/${SAMPLE}/cnvnator/${SAMPLE}_cnvs.vcf.gz or _cnvs.txt |
If CNVnator output is in TXT format (its native output), the script automatically converts it to VCF before merging.
./scripts/22-survivor-merge.sh your_name
chromosome + position/1000 + SVTYPE| File | Contents |
|---|---|
${SAMPLE}_sv_consensus.vcf.gz |
Consensus SVs called by 2+ callers |
${SAMPLE}_sv_consensus.vcf.gz.tbi |
Tabix index |
sv_files.txt |
List of input VCFs used |
consensus_raw.txt |
Intermediate merged records |
All output is written to ${GENOME_DIR}/${SAMPLE}/sv_merged/.
~5-15 minutes (mostly I/O reading the input VCFs).
A typical 30X WGS genome produces:
After consensus filtering, expect 200-1,000 multi-caller SVs. These have lower false-positive rates than single-caller calls, though the 1 kb binning heuristic is less precise than dedicated tools like SURVIVOR or Jasmine.
SV types in the output:
# Count consensus SVs by type
docker run --rm -v "${GENOME_DIR}:/genome" staphb/bcftools:1.21 \
bcftools query -f '%INFO/SVTYPE\n' \
/genome/${SAMPLE}/sv_merged/${SAMPLE}_sv_consensus.vcf.gz | sort | uniq -c | sort -rn