Running SVFX

I would like to run your new SVFX method on some structural variants. For full disclosure, I’m working on a method to assess the pathogenicity of germline SVs, and would like to compare with yours. Based on reading your preprint, I believe our methods are quite distinct in terms of training data. I think it’s great you’ve already put code on github, but I’m not sure what data files are needed to run the code. Could you put me in touch with one of your students to help me run SVFX locally?

Thanks for your interest in SVFX. We have reported our feature list in supplement table1.

Overall, our feature list is extracted from a bunch of genomic annotations and various functional genomics/epigenomics signal files.

You can download signal files from iHEC or epigenome roadmap data portal. As you might have noticed, we created multiple tissue-specific models for our analysis.

For the germline model, we also built a feature matrix based on the h1HESC cell line, which performed quite well. On the SVFX GitHub page, we have uploaded the bed file for different annotations (under the data folder) used in our study.

1000G enquiry – Breakpoints File Interpretation

I’m trying to interpret your breakpoints file at

Is this file the same as Supplementary Table 3 in the SV map paper?

Yes, they are the same.

What VCF should be used to interpret this file? I’m having difficulty
finding a VCF that has all the IDs accounted for.

Does the breakpoints file contain information that is meant to
override that in the VCF? So if the VCF and the breakpoints file
disagree on the position of a variant, the breakpoints file should be
considered correct?

The VCF file SV events are all SVs identified after taking their unions among other steps. The breakpoint file only contains SVs identified with breakpoint-level resolution by each variant caller. They do not override each other but should be treated as separate datasets. The breakpoint file can be considered to contain more detailed information of the SV region in the union call file.

It looks like the breakpoints file contains an INSSEQ column, giving
(anchored) sequences that are inserted at the same time as deletion
events. That makes the deletion into a substitution of the shorter
sequence for the longer sequence, right?

Yes, these deletions contain mostly micro-insertions (1-20bp) at the deletion site.

It would be ideal for my application if I could get a VCF containing
the information from this file. Is that already available? Have the
more precise breakpoint calls been rolled into e.g.
already? If not, do you have advice on how to cram this information
into a VCF while preserving its semantics?

I am not aware of a breakpoint file in VCF format. You may start with considering including just the chromosome, start, end and type information.

Using use the current 1000 genomes reference (Phase4 reference) to use BreakSeq2 to perform SV calling

We have started a Cloud-based Cancer SV Calling project and would like to use BreakSeq2 to perform SV calling, but would like to use the current 1000 genomes reference (Phase4 reference). Because Breakseq2 relies on the coordinates in the breakpoint library GFF, we were hoping that we could either obtain an updated breakpoint library or some advice on the feasibility of using coordinate liftover (via the available hg19 to Hg38 UCSC chain files) to update the coordinates in the GFF inside the latest library hosted on your lab website at:

We are under a time constraint with regard to the Cloud Compute funding, so we would very grateful if you could reply back soon.

I think the best option right now would be to lift over the coordinates to hg38. Both the GFF and the INS files need to be lifted over (you can use CrossMap which supports GFF). After the liftover, you can check to ensure that the SV lengths were lifted correctly, it might be good to ignore SVs whose lengths after the liftover changed. Note that for the INS file, you will need to write a script to liftover the coordinates in the read-name. You can check out the example on the BreakSeq2 page ( for how to run from GFF (you will need both the GFF and the INS file). Hope that helps.

Source code for paper “MSB: A mean shift based approach for the analysis of structural variation in the genome”

I have recently read your paper "MSB: A mean shift based approach for the analysis of structural variation in the genome",But it was hard for me to realize your method.Could you please send your source code for me to reference? Thanks for your kinder consideration .

the mean-shift alg. is very similar to that of CNVnator for which distribute code. I suggest you use that.

Interfering the ancestral state of inversions using BreakSeq

I am writing to you because I am working with the BreakSeq software
which was developed by your team and I am having some troubles.

The work-frame of our group is focused on inversions, and recently we have
started using BreakSeq for the annotation of the breakpoint features.
BreakSeq seems to work fine for all its steps except when interfering the
ancestral state of the inversions. I have successfully installed Blat on our
server and also opened the server connection for the three primate genomes.
In addition, I updated the paths of the BreakSeq configuration file which
allows the correct execution of BreakSeq.

However, If I check the ancestral state of some validated inversions from
our database (, which we known that have different
orientation (standard or inverted) in the primate genomes, BreakSeq
annotates ALL them as Rect "0:0:0". Which I understand that means that the
inversion has the same orientation in all 3 primate genomes.

I will show you an example of what I am trying to explain. If I run breakseq
for the annotation of the inversion HsInv0501
(, which its orientation is
standard for chimpanzee but inverted for orangutan and macaque, I would
expect the following output: Rect "0:1:1". However, BreakSeq output is Rect

In conclusion, my main question is the following one: Can BreakSeq predict
the ancestral state in the case of inversions? If it can, where do you think
I am doing something wrong for obtaining every time Rect "0:0:0" as output?

I am attaching the gff input file containing the inversions that are at
least different orientated in one of the three primates, the BreakSeq
configure file which I am using, and also the resulting output folder after
running BreakSeq.

BreakSeq was not intended to look at inversions initially but I suspect it should be usable with some modifications. Alternatively you could reproduce the way Breaseq interprets alignments to primate genomes for interpretation of ancestry

Size of SV in BreakSeq output

I have been using BreakSeq for identification of SV along with Break Dancer, CNVnator and Pindel. I was able to run BreakSeq and get SV. However, recently while submitting data to dbVar, I came know that I should also provide information on SIZE of SV. As BreakSeq output does not mention SIZE of each SV’s in its output it has become bit difficult to provide SIZE information to dbVar. However, I find POS and END position in output. Can I consider difference of POS and END as SIZE of SV?

For deletions you can use the pos and end for size. For insertions, the current version does not give you the size. We are planning for a next version which should have size. If you have to get it now, you can basically get the size from the insertion fasta distributed along with breakseq.

Qs about breakseq tool

I have just installed Breakseq tool developed by your lab to analyse structural variant in pancreatic cancer genome,

All the required modules has been downloaded, however, I could not find documentation of how to run the tool.

I was wondering is there any manual or an example on how to run the tool?

Or may I could contact someone in the lab who is familiar with Breakseq?

everything we have is at

query regarding Breakseq usage


I am using Breakseq to find mechanism of structure variations (SV) mapped using different package. I got stuck while using svMech module, probably due to lack of its user manual.
I only want to find mechanism of SV, so I have commented Ancestral state and feature analysis in annotate script under bin directory of breakseq.
It is working fine if I give only deletions in gff file. But when I give Insertions in gff file, it exits with following error

********** Creating standard breakpoint library **********
Traceback (most recent call last):
File "/home/pankaj/breakseq/breakseq-1.3/bin/svUtil/", line 20, in <module>
File "/home/pankaj/breakseq/breakseq-1.3/lib/biopy/io/", line 103, in get_sequence
return self.base.get_sequence(, self.start, self.end)
AttributeError: ‘NoneType’ object has no attribute ‘get_sequence’
Command exited with non-zero status 1
0.13user 0.04system 0:00.21elapsed 83%CPU (0avgtext+0avgdata 60800maxresident)k
0inputs+8outputs (0major+4306minor)pagefaults 0swaps

Could you please resolve my following queries regarding breakseq

(1) For Insertion, Do I need to provide inserted sequence explicitly or does this package find internally.

(2) Does this package also find mechanism of translocations. If yes, which keyword should I use in 3rd column of gff file.

1) you have to provide the inserted sequence. (see as an example)

2) it does not currently support translocations. (not mentioned on our paper)

Question regarding paper “Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library”


I read your excellent breakSeq paper "Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library", and now I have some whole genome sequencing data to be analyzed. The breakpoint library you apply ( is based on human genome NCBI build 36, but I use NCBI build 37 now. So should I lift-over the coordinate to the NCBI build 37 or realign the junction sequences to the NCBI build 37 first by myself? Or is there any pre-compiled breakpoint junction library used for NCBI build 37 ? By the way, any suggestions about adding the SVs identified in 1000 genome project to the breakpoint junction library ?

There are two sets of SV breakpoints that should be relevant to you:

The published 1000 Genomes pilot data in Mills et al Nature 2010:
The 1000 Genomes phase I data that is going to be published soon:

The published pilot data is on NCBI build 36. Using liftover to convert the genomic coordinates to NCBI build 37 should suffice. You might want to double check whether the SV size and the junction sequences are consistent before and after the liftover.

The phase I data is on NCBI build 37. You may simply take the junction sequences at the breakpoints to add to the library.