ASE analysis within your article (A uniform survey of allele-specific binding and expression over 1000-Genomes-Project individuals)

I am writing to your regarding the code to analyse ASE in RNA-seq data present within this Article, specifically the beta diversity evaluation and application test. I was wondering if the code is available, I would like to apply it compare samples.

The code for calling allele-specific sites is available at

The specific scripts for the beta-binomial test are

Question about liftover from pat/mat to ref

I’ve used your AlleleSeq package to get NA12878 diploid genome alignment. I have two questions:

1. How can I liftover from paternal/maternal alignments to hg19 coordinates? Based on the documents, the map files generated by vcf2diploid are used to convert from hg19 to pat/mat, not the other direction. Last year I found an R script "" from your website and used that to convert from pat/mat to hg19. However I cannot find it anymore. Do you have an updated script for this liftover? If "" is still valid to use, what reference should we cite in our manuscripts?

2. I noticed that both the "" script in AlleleSeq and the R script "" assume bowtie format in input alignment files. Do you have upgraded versions that are compatible with SAM format?

1. We use the chainSwap tool ( to flip the maternal.chain and paternal.chain files (generated by vcf2diploid) to the other direction. Then a .bed file in the parental coordinates can be converted into a .bed file in the reference coordinates using the UCSC liftOver tool and the swapped chain files.

This approach is straightforward for any arbitrary interval, but if you are converting alignments, it may require additional scripting or using other tools. We’ve received a few issues/questions with the script you are mentioning and since we didn’t develop it and aren’t currently planning to maintain, we removed the link from the web-site for now.

2. We do not have updated released versions yet, but this is one of the things that will be introduced in future versions and thus I might be able to help. What exactly are you interested in: is it just compatibility with SAM-formatted bowtie1 (i.e. ungapped alignments) output or are you trying to use another aligner?

Thanks for your detailed answers. They are very helpful.
We used bowtie2 or tophat2 as the aligners for DNA or RNA.
Currently we hope to liftover the pat/may alignments to hg19 for downstream analyses (e.g. HiC or DEG). We can’t compare results if they are not liftovered to the same reference.
Could I take the mapped position in sam, convert it to bed, liftover to hg19, and then chip this new position back to the sam? Do you have a better solution?

The start coordinates of the reads can be transferred this way, but I don’t think inserting them back to the sam file will produce a correct sam/bam file overall: some of the other fields (CIGAR string, mismatches, etc) will need to be adjusted as well. Depending on the further analysis this may be an issue. CrossMap seems to work with sam/bam and chain files. I have never used it though.

Maybe it is easier to do the analysis on the parental alignments? Say, for DEG in order to generate read counts table, one might consider transferring the annotation to mat and pat coordinates. Then for every exon/gene extract the reads mapped to it in both alignments and use the number of unique read-ids as the gene read count.

Where does the merge script come in your approach? Do you want to merge the alignments before transferring them to hg19 and then do further analyses (which I am guessing, do not involve looking into allele-specific counts)?

AllelSeq Info

I am trying to find Allel specific expression from a hybrid mouse RNA-Seq. I
have the complete genome of the two parental strains and the corresponding
SNPs and Indels as vcf format. I don’t have a complete genome and genotype
for the hybrid mouse.

I am looking to your AllelSeq tools you have published and attempted to
create a custom genome using vcf2diploid_v0.2.6a. Would you please guild me
OR forward to someone in your group who will guide me if the tools will be
useful in such context? I see the tool was developed with Trios.

AlleleSeq/vcf2diploid can be used both with trios and with single samples.
Either way, you would need to have a .vcf for the sample (hybrid mouse) in order to use the tools though.

ENCODE analysis with NA12878 genome

I seem to remember that you were doing some ENCODE analysis using an actual
NA12878 genome instead of the human reference GRCh37 (or whatever what the
current version at the time). Am I remembering this correctly? Was there
ever a comparison published that showed the benefits of using the actual
NA12878 genome versus the reference?

yes to all above!

this is was in the encode phase 2 manuscript (we did the fig.)
Also, there were detailed follow up analyses in

Data from 1000-Genomes Allele-Specific Binding Paper

I came across your lab’s paper Chen et al, 2015 in Nature Communications on allele-specific binding and expression in 1000-Genomes-Project individuals, and was hoping to integrate that data with some analysis on DHSs.

I found the data available on the site, and found the list of SNPs with significant ASB and ASE, but was wondering if you had the total list of SNPs queried in a format similar to the ASB and ASE tables. If not, do you know what the easiest way to assemble that from the data available on the site is?

All heterozygous SNVs of the individuals were queried for ASB/ASE and are available in the VCF format from the 1000 GP site. SNVs with high enough read coverage to be able to detect ASB and ASE events (‘accessible SNVs’) are available from under (3) and (4), respectively. These files are in a similar format to the tables with significant ASB/ASE events.

AlleleDB: ChrX ASB data for CTCF?

I’m looking for data to tell me
whether there is ASB for CTCF at a handful of common SNPs on ChrX, but don’t
see ChrX in your database. Any chance you have this data anyway and could
look up a few common SNPs? Or know of another place I could find this info?

Thank you for your interest in our AlleleDB resource! I am afraid the current AlleleDB version focuses on the autosomes only, and I haven’t worked on chr X data when I left. I have CCed Timur who has taken over the resource and might be able to provide more information/updates.

AlleleSeq CNV file from CNVnator

I have enjoyed your papers on allele specificity and I have a question about
using AlleleSeq. I understand time is short and valuable and would very much
appreciate it. I am making the CNV files with format:

chrm snppos rd
1 52066 0.902113
1 695745 0.909802
1 742429 0.976435

for input to the AlleleSeq pipeline.

I am using the alleleSeq_cnvScript tool to convert the output from CNVnator
v0.3, into the required CNV file, and it appears to work. However, my
problem is that it is running prohibitively slowly.

My alignment BAM is from 1000 genomes Phase III low coverage WGS, and is
24GB in size.

If I process only Chromosome 1, I have a ROOT file of 156 MB. I have six
million SNPs in a SNV file of 171 MB.

The addRD program runs using only 4% of the 16GB of RAM I have available,
but will take many weeks to complete at the current rate.

The rate at which addRD runs slows down dramatically with time. Though I am
not proficient in c++, I examined the code to see if I could identify why it
is slowing with time. I guess it is due to the search through the ROOT file
for each window around each SNP. The search restarts from the beginning for
each SNP, and so as the SNP locations become further along the chromosome
this search takes longer. I imagine that a great deal of time could be saved
by initialising each search based on the previous search?

If this approach is not possible for me, please could you advise on whether
the following algorithm would be appropriate for input to AlleleSeq:

1) divide the BAM file into windows of with W and count the number of reads
in each window.

2) Calculate the mean Read Depth (perhaps as a function of GC content): mu

1) for each SNP in my SNV file:

Use bamtools to select the reads in the window of size W centred on
the SNP location and calculate the Read Depth

(perhaps correct RD for GC content)

Calculate the normalised read depth = RD/(2 mu L/W)

output the SNP location and normalised read depth to file.

Have you tried using the latest version of Personal Genome Constructor?

When generating the .cnv file suitable for AlleleSeq, the pipeline uses bedtools to get read depth around each hetSNP instead of CNVnator. From my experience, this doesn’t take more than a few hours on a ~100GB WGS .bam file (single thread, all chromosomes).

1000genomes allele proves japanese ancestry?

I’ve been messaging the
list of contributors to the 1000genomes project to help me confirm the data
from you project is the proof I need in my search for japanese ancestry, and
would really appreciate your approval on whether what I’ve found is the
breakthrough I’ve been searching for.

Thanks to the help of a population geneticist from china who has a good set
of japanese allele at their disposal, she was able to find two japanese
specific allele. One of which is rs184214090 AG, i carry this one. And if
you take a look at ensembl/1000genomes population frequency you will see
that the AG allele is only found in the japanese sample.

I also spoke with Mr saitou naruya who is a very well cited japanese
geneticist and he told me that if I can find an allele that is specific
restricted to japan this is a good indication which is known as "private
polymorphism". Can you please give a quick confirmation to this whether I’ve
got this all right?

I agree with what Mr Saitou said about private polymorphism.