ASE analysis within your article (A uniform survey of allele-specific binding and expression over 1000-Genomes-Project individuals)

I am writing to your regarding the code to analyse ASE in RNA-seq data present within this Article, specifically the beta diversity evaluation and application test. I was wondering if the code is available, I would like to apply it compare samples.

The code for calling allele-specific sites is available at

The specific scripts for the beta-binomial test are

Question about liftover from pat/mat to ref

I’ve used your AlleleSeq package to get NA12878 diploid genome alignment. I have two questions:

1. How can I liftover from paternal/maternal alignments to hg19 coordinates? Based on the documents, the map files generated by vcf2diploid are used to convert from hg19 to pat/mat, not the other direction. Last year I found an R script "" from your website and used that to convert from pat/mat to hg19. However I cannot find it anymore. Do you have an updated script for this liftover? If "" is still valid to use, what reference should we cite in our manuscripts?

2. I noticed that both the "" script in AlleleSeq and the R script "" assume bowtie format in input alignment files. Do you have upgraded versions that are compatible with SAM format?

1. We use the chainSwap tool ( to flip the maternal.chain and paternal.chain files (generated by vcf2diploid) to the other direction. Then a .bed file in the parental coordinates can be converted into a .bed file in the reference coordinates using the UCSC liftOver tool and the swapped chain files.

This approach is straightforward for any arbitrary interval, but if you are converting alignments, it may require additional scripting or using other tools. We’ve received a few issues/questions with the script you are mentioning and since we didn’t develop it and aren’t currently planning to maintain, we removed the link from the web-site for now.

2. We do not have updated released versions yet, but this is one of the things that will be introduced in future versions and thus I might be able to help. What exactly are you interested in: is it just compatibility with SAM-formatted bowtie1 (i.e. ungapped alignments) output or are you trying to use another aligner?

Thanks for your detailed answers. They are very helpful.
We used bowtie2 or tophat2 as the aligners for DNA or RNA.
Currently we hope to liftover the pat/may alignments to hg19 for downstream analyses (e.g. HiC or DEG). We can’t compare results if they are not liftovered to the same reference.
Could I take the mapped position in sam, convert it to bed, liftover to hg19, and then chip this new position back to the sam? Do you have a better solution?

The start coordinates of the reads can be transferred this way, but I don’t think inserting them back to the sam file will produce a correct sam/bam file overall: some of the other fields (CIGAR string, mismatches, etc) will need to be adjusted as well. Depending on the further analysis this may be an issue. CrossMap seems to work with sam/bam and chain files. I have never used it though.

Maybe it is easier to do the analysis on the parental alignments? Say, for DEG in order to generate read counts table, one might consider transferring the annotation to mat and pat coordinates. Then for every exon/gene extract the reads mapped to it in both alignments and use the number of unique read-ids as the gene read count.

Where does the merge script come in your approach? Do you want to merge the alignments before transferring them to hg19 and then do further analyses (which I am guessing, do not involve looking into allele-specific counts)?

AllelSeq Info

I am trying to find Allel specific expression from a hybrid mouse RNA-Seq. I
have the complete genome of the two parental strains and the corresponding
SNPs and Indels as vcf format. I don’t have a complete genome and genotype
for the hybrid mouse.

I am looking to your AllelSeq tools you have published and attempted to
create a custom genome using vcf2diploid_v0.2.6a. Would you please guild me
OR forward to someone in your group who will guide me if the tools will be
useful in such context? I see the tool was developed with Trios.

AlleleSeq/vcf2diploid can be used both with trios and with single samples.
Either way, you would need to have a .vcf for the sample (hybrid mouse) in order to use the tools though.

Illumina/PlatinumGenomes — Can’t access NA12878 files on ftp site (#4)

The map files I’m looking at are in NA12878_diploid_2017_jan7 and the VCF I’m looking at is IPG_2016_1.0_callset/NA12878.vcf.gz.

Just to make sure I’m not misunderstanding, does the map file contain variant coordinates, or is it another form of chain file that gives the offsets of match blocks? I’ve been assuming they’re coordinates of variants listed in the VCF file. Also, I’m assuming the lines with one or more zeroes are indels?

Here are some examples of my attempts to match the coordinates between the map files and the VCF.

First, these three lines are taken from

249172383 249098765 249117098
249193024 249119406 249137737
249193998 249120380 249138710

When I take the first number from each of these lines and grep it out of the VCF file, I get no hits, even if I add 1 or subtract 1 to account for differences in the use of 0-based versus 1-based coordinates:

cat NA12878.vcf.gz | gunzip | grep 249172383
cat NA12878.vcf.gz | gunzip | grep 249172382
cat NA12878.vcf.gz | gunzip | grep 249172384

cat NA12878.vcf.gz | gunzip | grep 249193024
cat NA12878.vcf.gz | gunzip | grep 249193023
cat NA12878.vcf.gz | gunzip | grep 249193025

cat NA12878.vcf.gz | gunzip | grep 249193998
cat NA12878.vcf.gz | gunzip | grep 249193997
cat NA12878.vcf.gz | gunzip | grep 249193999

I then tried the opposite: I took coordinates from the VCF and grepped those from the map files, but also with no hits. Here are three lines from the VCF file:

chr19 59097308 . G A . PASS MTD=cgi,bwa_freebayes,bwa_platypus,isaac_strelka,bw
a_gatk;KM=8.86;KFP=0;KFF=0 GT 0|1
chr19 59097977 . T C . PASS MTD=isaac_strelka;KM=7.00;KFP=0;KFF=0 GT 0|1
chr19 59098781 . G T . PASS MTD=bwa_gatk;KM=5.14;KFP=0;KFF=0 GT 0|1

When I grep these coordinates (plus or minus 1) from I get no hits:

cat | grep 59097308
cat | grep 59097307
cat | grep 59097309

cat | grep 59097977
cat | grep 59097976
cat | grep 59097978

cat | grep 59098781
cat | grep 59098780
cat | grep 59098782

Could it be that this newest diploid genome was generated from hg38 rather than hg19?

The problem isn’t with the link. We were able to download the files. The problem is that the coordinates in the downloaded VCF and map files don’t seem to match up. Please scroll down for my original description of the problem.

The .map files are actually another form of the .chain files.
The first coordinate you looked at (249172383) seems to be the start coordinate of the block that accounts for the coordinate shift due to an indel preceding it:

$grep -w ^249172383 -B 3
249112564 249038946 0
249112565 249038947 249057281
249172382 249098764 0
249172383 249098765 249117098

$grep -w 249172381 inp_files/IPG_2016_1.0_callset/NA12878.vcf
chr1 249172381 . CT C . PASS MTD=bwa_gatk;KM=3.12;KFP=1;KFF=1 GT 0|1

There is a ‘T’ in the reference at 249172382 that is present in PAT, but not in MAT (thus, ‘0’ in the third column). This corresponds to a 1 bp block in REF (249172382) and PAT and a 0 bp block in MAT.
The next block starts after the deletion in MAT and the corresponding REF coordinate is 249172383.

Similarly, for your second example: it is a 2bp shift this time and so the new block starts at +3 bp in REF with respect to the corresponding indel coordinate:

$grep -w ^249193024 -B 3
249172382 249098764 0
249172383 249098765 249117098
249193022 249119404 0
249193024 249119406 249137737

$grep -w 249193021 inp_files/IPG_2016_1.0_callset/NA12878.vcf
chr1 249193021 . TAG T . PASS MTD=isaac_strelka,bwa_freebayes,bwa_platypus,bwa_gatk;KM=12.40;KFP=0;KFF=0 GT 0|1

The third example is similar to the first one:

$grep -w ^249193998 -B 3
249193022 249119404 0
249193024 249119406 249137737
249193997 249120379 0
249193998 249120380 249138710

$grep -w 249193996 inp_files/IPG_2016_1.0_callset/NA12878.vcf
chr1 249193996 . GA G . PASS MTD=bwa_freebayes,bwa_platypus,bwa_gatk;KM=5.81;KFP=0;KFF=0 GT 0|1

The opposite examples are all SNVs and do not cause coordinate shifts and thus aren’t present in the .map files.

The reference sequence used to create the personal genome is hg19 and the .fasta file can be found in the same archive: reference_callsets_files/reference_annotation/female.hg19.fasta.gz

I am also attaching a README file containing a brief description of the file format (from vcf2diploid,

README file for vcf2diploid tool distribution v_0.2.6a (19th Sept 2014)

This version of vcf2diploid integrates vcf2diploid_v_0.2.6 with generation of read depth for AlleleSeq filtering of potential SNPs residing in CNV locations.
There is also an additional option on vcf2diploid for an output folder.

The read-depth-file generator calculates a normalized read depth for a SNP in a 2000-bp window (+- 1000bp) around the SNP against an average depth computed
from the entire genome. The output generated can be directly used for the AlleleSeq pipeline.

Rozowsky J, Abyzov A, Wang J, Alves P, Raha D, Harmanci A, Leng J, Bjornson R, Kong Y, Kitabayashi N, Bhardwaj N, Rubin M, Snyder M, Gerstein M.
AlleleSeq: analysis of allele-specific expression and binding in a network framework.
Mol Syst Biol. 2011 Aug 2;7:522. doi: 10.1038/msb.2011.54.

After installation/compilation of vcf2diploid, modify the file to run the following
1) vcf2diploid
2) vcf2snp
3) read-depth-file generator

USAGE: make -f makePersonalGenome DATA_DIR=/path/to/your/VCFdir OUTPUT_SAMPLE_NAME=NA12878
–currently DATA_DIR is set as the directory where the BAM and VCF files are kept and where the output directory is going to be.
–options can be modified in the file

Acknowledgement: The modifications in this version are created by R. Kitchen of the Gerstein Lab@Yale.

he following section describes vcf2diploid_v_0.2.6

1. Compilation
$ make

2. Running
java -jar vcf2diploid.jar -id sample_id -chr file.fa … [-vcf file.vcf …]

where sample_id is the ID of individual whose genome is being constructed
(e.g., NA12878), file.fa is FASTA file(s) with reference sequence(s), and
file.vcf is VCF4.0 file(s) with variants. One can specify multiple FASTA and
VCF files at a time. Splitting the whole genome in multiple files (e.g., with
one FASTA file per chromosome) reduces memory usage.
Amount of memory used by Java can be increased as follows

java -Xmx4000m -jar vcf2diploid.jar -id sample_id -chr file.fa … [-vcf file.vcf …]

You can try the program by running ‘’ script in the ‘example’
directory. See also "Important notes" below.

3. Constructing personal annotation and splice-junction library

* Using chain file one can lift over annotation of reference genome to personal
haplotpes. This can be done with the liftOver tools

For example, to lift over Gencode annotation once can do

$ liftOver -gff ref_annotation.gtf mat.chain mat_annotation.gtf not_lifted.txt

* To construct personal splice-junction library(s) for RNAseq analysis one can
use RSEQtools (

Important notes

All characters between ‘>’ and first white space in FASTA header are used
internally as chromosome/sequence names. For instance, for the header

>chr1 human

vcf2diploid will upload the corresponding sequence into the memory under the
name ‘chr1’.
Chromosome/sequence names should be consistent between FASTA and VCF files but
omission of ‘chr’ at the beginning is allows, i.e. ‘chr1’ and ‘1’ are treated as
the same name.

The output contains (file formats are described below):
1) FASTA files with sequences for each haplotype.
2) CHAIN files relating paternal/maternal haplotype to the reference genome.
3) MAP files with base correspondence between paternal-maternal-reference

File formats:
* FASTA — see
* MAP file represents block with equivalent bases in all three haplotypes
(paternal, maternal and reference) by one record with indices of the first
bases in each haplotype. Non-equivalent bases are represented as separate
records with ‘0’ for haplotypes having non-equivalent base (see
clarification below).

Pat Mat Ref MAP format
X X X \
X X X > P1 M1 R1
X X – > P4 M4 0
X X – , > 0 M6 R4
– X X ‘ ,-> P6 M7 R5
X X X ‘

For question and comments contact: Alexej Abyzov (alexej.abyzov) and
Mark Gerstein (mark.gerstein)

Data from 1000-Genomes Allele-Specific Binding Paper

I came across your lab’s paper Chen et al, 2015 in Nature Communications on allele-specific binding and expression in 1000-Genomes-Project individuals, and was hoping to integrate that data with some analysis on DHSs.

I found the data available on the site, and found the list of SNPs with significant ASB and ASE, but was wondering if you had the total list of SNPs queried in a format similar to the ASB and ASE tables. If not, do you know what the easiest way to assemble that from the data available on the site is?

All heterozygous SNVs of the individuals were queried for ASB/ASE and are available in the VCF format from the 1000 GP site. SNVs with high enough read coverage to be able to detect ASB and ASE events (‘accessible SNVs’) are available from under (3) and (4), respectively. These files are in a similar format to the tables with significant ASB/ASE events.

AlleleSeq CNV file from CNVnator

I have enjoyed your papers on allele specificity and I have a question about
using AlleleSeq. I understand time is short and valuable and would very much
appreciate it. I am making the CNV files with format:

chrm snppos rd
1 52066 0.902113
1 695745 0.909802
1 742429 0.976435

for input to the AlleleSeq pipeline.

I am using the alleleSeq_cnvScript tool to convert the output from CNVnator
v0.3, into the required CNV file, and it appears to work. However, my
problem is that it is running prohibitively slowly.

My alignment BAM is from 1000 genomes Phase III low coverage WGS, and is
24GB in size.

If I process only Chromosome 1, I have a ROOT file of 156 MB. I have six
million SNPs in a SNV file of 171 MB.

The addRD program runs using only 4% of the 16GB of RAM I have available,
but will take many weeks to complete at the current rate.

The rate at which addRD runs slows down dramatically with time. Though I am
not proficient in c++, I examined the code to see if I could identify why it
is slowing with time. I guess it is due to the search through the ROOT file
for each window around each SNP. The search restarts from the beginning for
each SNP, and so as the SNP locations become further along the chromosome
this search takes longer. I imagine that a great deal of time could be saved
by initialising each search based on the previous search?

If this approach is not possible for me, please could you advise on whether
the following algorithm would be appropriate for input to AlleleSeq:

1) divide the BAM file into windows of with W and count the number of reads
in each window.

2) Calculate the mean Read Depth (perhaps as a function of GC content): mu

1) for each SNP in my SNV file:

Use bamtools to select the reads in the window of size W centred on
the SNP location and calculate the Read Depth

(perhaps correct RD for GC content)

Calculate the normalised read depth = RD/(2 mu L/W)

output the SNP location and normalised read depth to file.

Have you tried using the latest version of Personal Genome Constructor?

When generating the .cnv file suitable for AlleleSeq, the pipeline uses bedtools to get read depth around each hetSNP instead of CNVnator. From my experience, this doesn’t take more than a few hours on a ~100GB WGS .bam file (single thread, all chromosomes).

AlleleSeq question

I would like to use AlleleSeq with exome data but am unsure how to generate the .cnv file. If I supply a BAM from my exome sequencing experiment then won’t this produce some very low average coverage if CNVnator examines coverage across the genome? This will mean that my exonic variants will be excluded because AlleleSeq interprets them as being in a CNV… I assume I have to supply a FASTA containing only the target regions of the exome enrichment – is this true?

I haven’t used AlleleSeq with exome data (I will ask around some people in the lab and will probably write to you again next week), but I would try using the latest version of the pipeline: the 0.2.6a version of Personal Genome constructor ( doesn’t use CNVnator, instead the pipeline uses bedtools and calculates median read depth across +/-1000bp regions around snps. So if all snps are exonic, the read depth around each snp will be compared to the median value across all the other ones. This might bias snps located close to exon/intron junctions, if that happens, I would consider decreasing the window size.

…We discussed your question in the lab and believe that it is better not to perform the cnv identification step at all: there should not be many of them in the exome, and the edge effect could be avoided as well. The latest version of the alleleseq pipeline still requires a .cnv file to run. So, the easiest solution is to create the .cnv file with the list of all the snps (as in the .snp file) with rd=1.0:

chrm snppos rd
1 52066 1.0
1 695745 1.0
1 742429 1.0

questions about AlleleDB and genotyping

I am using your lab database AlleleDB for allelic imbalance SNPs of ChIP-seq data and I really find it a formidable resource for further research.

I noticed that many of the significant SNPs for allelic imbalance are SNPs that have only the reference allele in the ChIP-seq data. See the first 10 lines as an example from your file that can be downloaded from AlleleDB database :

chr start end TF_indiv_ASB ref alt cA cC cG cT p.binomial p.betabinomial
chr1 91604 91605 SA1_NA19099_ASB C T 0 45 0 0 5.68E-14 0.000376021311925
chr1 714018 714019 POL2_NA18505_ASB A G 2 0 19 0 0.000221252441406 0.003114747121148
chr1 714018 714019 POL2_NA19099_ASB A G 2 0 19 0 0.000221252441406 0.002559182427128
chr1 762484 762485 RPB2_NA11894_ASB C A 0 11 0 0 0.0009765625 0.004566586027278
chr1 762600 762601 RPB2_NA11894_ASB T C 0 12 0 0 0.00048828125 0.002968280917731
chr1 840035 840036 SA1_NA18951_ASB T G 0 0 1 168 2.44E-37 0.001745366786663
chr1 937397 937398 SA1_NA19099_ASB G A 0 0 81 0 8.27E-25 5.03E-05
chr1 1167795 1167796 RPB2_NA11894_ASB C T 0 26 0 0 2.98E-08 2.29E-05
chr1 1407371 1407372 RPB2_NA11831_ASB T C 0 0 0 41 9.09E-13 0.000260638330353
chr1 1551110 1551111 POL2_NA19099_ASB C G 0 30 0 0 1.86E-09 1.67E-06

If I understand well allelic imbalance, only SNPs with genotypes that are heterozygous are tested for allelic imbalance.
In this case, the first SNP in the above table (for instance) has a genotype C/T, but shows 45 reads with C and 0 reads with T. This is indeed a strong allelic imbalance. In your data, there are many examples like that.

However, I heard many biologists saying that cell lines are not stable and mutations accumulate over time.
Is it possible that the first SNP is actually homozygous C/C in the cells used for ChIP-seq experiment because of mutations that occurred during cell culture, given that the genotyping by 1000 Genome project and the ChIP-seq experiment might have been done at different times or by different labs?

What do you think?

Thank you for your interest in AlleleDB!
Right, only heterozygous SNVs are tested for the imbalance. We used the 1000G Project variants for these analyses, did not call them ourselves. Indeed, allelic calculations are very sensitive to genotype calls. Thus, a false-positive heterozygous SNV call, as well as a mutation in the cell-line as you described (not sure if there can be many of those), could lead to a false allelic imbalance call. If this is crucial for your analyses, a simple strategy would probably be filtering the call-set to keep only the SNVs where the ‘weaker’ allele is supported by at least one read, though this would remove some true strong imbalance calls as well.