1000G enquiry – Breakpoints File Interpretation

Q1:
I’m trying to interpret your breakpoints file at
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/integrated_sv_map/supporting/breakpoints/1KG_phase3_all_bkpts.v5.txt.gz.

Is this file the same as Supplementary Table 3 in the SV map paper?

A1:
Yes, they are the same.

Q2:
What VCF should be used to interpret this file? I’m having difficulty
finding a VCF that has all the IDs accounted for.

Does the breakpoints file contain information that is meant to
override that in the VCF? So if the VCF and the breakpoints file
disagree on the position of a variant, the breakpoints file should be
considered correct?

A2:
The VCF file SV events are all SVs identified after taking their unions among other steps. The breakpoint file only contains SVs identified with breakpoint-level resolution by each variant caller. They do not override each other but should be treated as separate datasets. The breakpoint file can be considered to contain more detailed information of the SV region in the union call file.

Q3:
It looks like the breakpoints file contains an INSSEQ column, giving
(anchored) sequences that are inserted at the same time as deletion
events. That makes the deletion into a substitution of the shorter
sequence for the longer sequence, right?

A3:
Yes, these deletions contain mostly micro-insertions (1-20bp) at the deletion site.

Q4:
It would be ideal for my application if I could get a VCF containing
the information from this file. Is that already available? Have the
more precise breakpoint calls been rolled into e.g.
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/integrated_sv_map/ALL.wgs.integrated_sv_map_v2.20130502.svs.genotypes.vcf.gz
already? If not, do you have advice on how to cram this information
into a VCF while preserving its semantics?

A4:
I am not aware of a breakpoint file in VCF format. You may start with considering including just the chromosome, start, end and type information.

Question regarding paper “Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library”

Q:

I read your excellent breakSeq paper "Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library", and now I have some whole genome sequencing data to be analyzed. The breakpoint library you apply (http://sv.gersteinlab.org/breakseq/) is based on human genome NCBI build 36, but I use NCBI build 37 now. So should I lift-over the coordinate to the NCBI build 37 or realign the junction sequences to the NCBI build 37 first by myself? Or is there any pre-compiled breakpoint junction library used for NCBI build 37 ? By the way, any suggestions about adding the SVs identified in 1000 genome project to the breakpoint junction library ?

A:
There are two sets of SV breakpoints that should be relevant to you:

The published 1000 Genomes pilot data in Mills et al Nature 2010: http://www.nature.com/nature/journal/v470/n7332/extref/nature09708-s9.xls
The 1000 Genomes phase I data that is going to be published soon: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase1/analysis_results/integrated_call_sets/

The published pilot data is on NCBI build 36. Using liftover to convert the genomic coordinates to NCBI build 37 should suffice. You might want to double check whether the SV size and the junction sequences are consistent before and after the liftover.

The phase I data is on NCBI build 37. You may simply take the junction sequences at the breakpoints to add to the library.