questions about AlleleDB and genotyping

Q:
I am using your lab database AlleleDB for allelic imbalance SNPs of ChIP-seq data and I really find it a formidable resource for further research.

I noticed that many of the significant SNPs for allelic imbalance are SNPs that have only the reference allele in the ChIP-seq data. See the first 10 lines as an example from your file ASB.auto.v2.1.aug16.txt.tgz that can be downloaded from AlleleDB database :

chr start end TF_indiv_ASB ref alt cA cC cG cT p.binomial p.betabinomial
chr1 91604 91605 SA1_NA19099_ASB C T 0 45 0 0 5.68E-14 0.000376021311925
chr1 714018 714019 POL2_NA18505_ASB A G 2 0 19 0 0.000221252441406 0.003114747121148
chr1 714018 714019 POL2_NA19099_ASB A G 2 0 19 0 0.000221252441406 0.002559182427128
chr1 762484 762485 RPB2_NA11894_ASB C A 0 11 0 0 0.0009765625 0.004566586027278
chr1 762600 762601 RPB2_NA11894_ASB T C 0 12 0 0 0.00048828125 0.002968280917731
chr1 840035 840036 SA1_NA18951_ASB T G 0 0 1 168 2.44E-37 0.001745366786663
chr1 937397 937398 SA1_NA19099_ASB G A 0 0 81 0 8.27E-25 5.03E-05
chr1 1167795 1167796 RPB2_NA11894_ASB C T 0 26 0 0 2.98E-08 2.29E-05
chr1 1407371 1407372 RPB2_NA11831_ASB T C 0 0 0 41 9.09E-13 0.000260638330353
chr1 1551110 1551111 POL2_NA19099_ASB C G 0 30 0 0 1.86E-09 1.67E-06

If I understand well allelic imbalance, only SNPs with genotypes that are heterozygous are tested for allelic imbalance.
In this case, the first SNP in the above table (for instance) has a genotype C/T, but shows 45 reads with C and 0 reads with T. This is indeed a strong allelic imbalance. In your data, there are many examples like that.

However, I heard many biologists saying that cell lines are not stable and mutations accumulate over time.
Is it possible that the first SNP is actually homozygous C/C in the cells used for ChIP-seq experiment because of mutations that occurred during cell culture, given that the genotyping by 1000 Genome project and the ChIP-seq experiment might have been done at different times or by different labs?

What do you think?

A:
Thank you for your interest in AlleleDB!
Right, only heterozygous SNVs are tested for the imbalance. We used the 1000G Project variants for these analyses, did not call them ourselves. Indeed, allelic calculations are very sensitive to genotype calls. Thus, a false-positive heterozygous SNV call, as well as a mutation in the cell-line as you described (not sure if there can be many of those), could lead to a false allelic imbalance call. If this is crucial for your analyses, a simple strategy would probably be filtering the call-set to keep only the SNVs where the ‘weaker’ allele is supported by at least one read, though this would remove some true strong imbalance calls as well.

help in alleleseq – with data

Q1:
I am writing to you for asking help on AlleleSeq pipeline

We are very interested in using AlleleSeq in our analysis as part
of ENCODE, and has done the analysis on many ENCODE ChIP-seq data
sets.

However, in our recent small scale simulation study (we simulated
reads from a few dozen SNPs locations), we have consistently got
very unexpected results, which makes us think maybe we did
something
wrong in the installation and configuration of AlleleSeq.

I am wondering whether you or anyone else in the lab can help us
run AlleleSeq on our small toy data, so that we can compare the
results and find out what is going on.

It will just take you a few minutes, at most 30 minutes running
time, since we use the same SNP file and diploid sequence of
NA12878
in our simulation and no peak file is involved.

We will really appreciate your help. Thank you very much.

I will send you the simulated data in a separate email without cc
to so many people. Thanks.

To everyone else in the email list: please do not hesitate if you
have any comment on this issue. Thanks.

A1:
I took a look at the gz file you sent. It appears to be just the
reads. In order to run AlleleSeq, I would need:
– the set of snps you are using
– the personal genome you are using

Also, can you confirm which version of AlleleSeq you are using, and
where you obtained it (just so I’m sure I use the same code)

Can you tell me exactly how you invoked it?

What results did you expect, and what did you see?

Q2:
I double-checked the input data we used with the one who downloaded
it and what we used are in

http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_dec16.2012.zip

[1]

and

http://alleleseq.gersteinlab.org/NA12878_diploid_dec16.2012.alleleseq.input.zip

[2]

Please let me know if it clarifies things. Thanks.

A2:
I’ve spent some time looking this over. Can you explain exactly how
the toy reads were generated? It appears to me that the snps in the
reads are in the wrong place, i.e. off by one.
For example, if I blat read number 59 , the mismatch appears at
reference position 671855, rather than the expected snp location
671856. See the attached jpeg. All of the coordinates used in
AlleleSeq are one based, EXCEPT for the output from bowtie, which
internally convert to 1 based.

Q3:
The sequences are extracted from a BSgenome object I created from the .fa
files of paternal and maternal sequences. The .fa files are the same as that
used in buidling the Bowtie indices.

The sequences are extracted without error, and the 3rd base-pair at 5′ end
is supposed to be the snp location.

I do not fully understand what you mean. Do you mean the SNPs are mis-placed
in the reads?

I am wondering whether one based/zero based coordinate could be a problem.
As long as the reads are identical to the sequence in the .fa file, the
coordinate difference shall not be a problem, right?

As a note, previously, another member in my group found out another issue in
AlleleSeq about the ordering in Bowtie output if multi cores are used for
alignment. We have fixed it in our own AlleleSeq pipeline and suggested the
change to Gerstein privately. I guess what you found is different. Thank you
again for helping us in this issue.

A3:
Can you check something for me? See if the definition for MAPS in
PIPELINE.mk you used to run matches the actual file names of the map files.
I don’t believe you ever sent your PIPELINE.mk to me. If you wouldn’t mind,
please send me the PIPELINE.mk that you are using, and a listing of the
contents of the directory where the genomes and map files are. It turns out
that, unfortunately, the current code fails silently (and badly) if the
names don’t match. I will need to fix that, and I’m hoping that is the
problem you ran into.

Secondly, I noticed that the snp calls from
NA12878_diploid_dec16.2012.alleleseq.input (NA12878.hiseq.snp.call) contain
Y snps. NA12878 is female, so that seems odd. It also causes the code to
break, because there is no map file for Y (reasonably enough).

When I addressed both of these issues, I got sensible output from your toy
input set (which I will attach).

Q4:
Thank you very much for bringing up the issue of Y snps breaking down the
pipeline.

Another member in my group was trying to install AlleleSeq, and failed to
run the make file. Then she worked out a .sh file to run AlleleSeq without
changing any code, and that is what I am using. Please see the attached for
the script, and the following for the file names in the folder of input
data.

On the other hand, I do not understand what you mean by failing when the
names do not match. I am not sure if it is the problem we have. Please see
the attached for our output of the .sh version of AlleleSeq on the toy
example. Thanks.

Qi

ls -R
.:
fa/ index/ map/ NA12878.hiseq.snp.call* NA12878.hiseq.snp.call.RDATA
rd_4327183snps_na12878_hg19.txt*

./fa:
maternal/ paternal/

./fa/maternal:
chr10_NA12878_maternal.fa chr16_NA12878_maternal.fa
chr21_NA12878_maternal.fa chr6_NA12878_maternal.fa maternal.ref.1.ebwt
chr11_NA12878_maternal.fa chr17_NA12878_maternal.fa
chr22_NA12878_maternal.fa chr7_NA12878_maternal.fa maternal.ref.2.ebwt
chr12_NA12878_maternal.fa chr18_NA12878_maternal.fa
chr2_NA12878_maternal.fa chr8_NA12878_maternal.fa maternal.ref.3.ebwt
chr13_NA12878_maternal.fa chr19_NA12878_maternal.fa
chr3_NA12878_maternal.fa chr9_NA12878_maternal.fa maternal.ref.4.ebwt
chr14_NA12878_maternal.fa chr1_NA12878_maternal.fa
chr4_NA12878_maternal.fa chrX_NA12878_maternal.fa maternal.ref.rev.1.ebwt
chr15_NA12878_maternal.fa chr20_NA12878_maternal.fa
chr5_NA12878_maternal.fa maternal.chain maternal.ref.rev.2.ebwt

./fa/paternal:
chr10_NA12878_paternal.fa chr16_NA12878_paternal.fa
chr21_NA12878_paternal.fa chr6_NA12878_paternal.fa paternal.ref.1.ebwt
chr11_NA12878_paternal.fa chr17_NA12878_paternal.fa
chr22_NA12878_paternal.fa chr7_NA12878_paternal.fa paternal.ref.2.ebwt
chr12_NA12878_paternal.fa chr18_NA12878_paternal.fa
chr2_NA12878_paternal.fa chr8_NA12878_paternal.fa paternal.ref.3.ebwt
chr13_NA12878_paternal.fa chr19_NA12878_paternal.fa
chr3_NA12878_paternal.fa chr9_NA12878_paternal.fa paternal.ref.4.ebwt
chr14_NA12878_paternal.fa chr1_NA12878_paternal.fa
chr4_NA12878_paternal.fa chrX_NA12878_paternal.fa paternal.ref.rev.1.ebwt
chr15_NA12878_paternal.fa chr20_NA12878_paternal.fa
chr5_NA12878_paternal.fa paternal.chain paternal.ref.rev.2.ebwt

./index:
maternal.ref.1.ebwt* maternal.ref.3.ebwt* maternal.ref.rev.1.ebwt*
paternal.ref.1.ebwt* paternal.ref.3.ebwt* paternal.ref.rev.1.ebwt*
maternal.ref.2.ebwt* maternal.ref.4.ebwt* maternal.ref.rev.2.ebwt*
paternal.ref.2.ebwt* paternal.ref.4.ebwt* paternal.ref.rev.2.ebwt*

./map:
chr10_NA12878.map* chr14_NA12878.map* chr18_NA12878.map*
chr21_NA12878.map* chr4_NA12878.map* chr8_NA12878.map*
chr11_NA12878.map* chr15_NA12878.map* chr19_NA12878.map*
chr22_NA12878.map* chr5_NA12878.map* chr9_NA12878.map*
chr12_NA12878.map* chr16_NA12878.map* chr1_NA12878.map*
chr2_NA12878.map* chr6_NA12878.map* chrX_NA12878.map*
chr13_NA12878.map* chr17_NA12878.map* chr20_NA12878.map*
chr3_NA12878.map* chr7_NA12878.map*

A4:
Thanks, that was very helpful. If you look at the shell script, you will
see this line:

MAPS=$BASE/inputdata/map/%s_NA12878.map

MAPS should be a filename pattern that matches the names of the map files.
In your case those files are named chr1_NA12878.map (notice the extra
"chr").

If you change the MAPS= line to:
MAPS=$BASE/inputdata/map/chr%s_NA12878.map

that will fix that problem. I really have to apologize for this; the code
should not silently ignore this mismatch.

You’ll then run into the problem I mentioned wrt the Y snps. If you filter
out the Y snps from the snp file NA12878.hiseq.snp.call

$ grep -v Y NA12878.hiseq.snp.call NA12878.hiseq.snp.call.NO_Y

And then use that for your snp file throughout, I think you will be all set.
Please let me know if that is the case.

Question about using AlleleSeq tool

Q:

I am a student doing a research project in allele-specific expression,
and am planning to use your lab’s AlleleSeq tool.

I am trying to use the YRI Population HapMap data. I went and tried to
find phased YRI trio data (from 1000 Genomes) to input into the
vcf2diploid tool. Unfortunately, I found data that includes only the
parent ID’s, but not the child’s. Since I don’t have the child data, I
am unable to use the AlleleSeq pipeline.

I was wondering if you could give me some suggestions on how to do ASE
given only the parental data.

A:
Thank you for your interest in the AlleleSeq pipeline.

The AlleleSeq pipeline assumes that the ‘child’ is the subject in which you
are trying to find ASE. Hence the genotypes of the subject are required.

GM12878 Updated Phased Trancriptome

Q:
I noticed that you have an updated version of the GM12878 diploid genome
on the AlleleSeq webpage, but not an updated version of the transcriptome.
Would you be willing to upload an update for the transcriptome as well?
Also, are there major differences between the versions?

I actually now see how to generate the transcriptome myself using
vcf2diploid and the vcf file mentioned in the README (which I hadn’t noticed
before). Do you believe this vcf
(ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.2/hg19/CEUTrio.HiSeq.WGS.b37.bestPractices.phased.hg19.vcf.gz)
is the most accurate?

I’m actually trying to use liftover with the chain files from the newest diploid genotypes you created to update an annotation. However, the annotation ends up being empty. Have you experienced a similar problem? I’m using the hg19 UCSC annotation. Any help you can provide would be MUCH appreciated!

A:
Your chain file does not include ‘chr’ in front of the reference names.

Single Cell Allele Specific Expression and AlleleSeq

Q:
We have developed a method to perform single cell allele specific expression by labeling individual mRNA molecules with single base specificity in situ. In other words, we can differentially label and detect transcripts with a single SNP difference with our fluorescent probes.

As part of our validation, we were happy to find your published diploid genome of GM12878, and used it to design probes for various candidate genes to see what patterns of allelic imbalance we could see at a single cell level. We were hoping that you might be able to provide us with some form of quantification for the allelic imbalance (such as # of reads that aligned to the maternal versus paternal allele) for some of the genes on your list of "Genes with allele specific expression" that you published on your website, such as SKA and SUZ12. We also noted that some of the genes on the "Genes with allele specific expression" have heterozygous SNPs in them that are not on the "SNP’s resulting in allele specific behavior table" and vice versa, so we weren’t exactly sure how to interpret that difference.

I look forward to hearing from you as being able to relate our single cell measurements with the genomic measurements is an exciting prospect, and I thank you again for having made available such a useful resource.

A:
thanks for interest to AlleleSeq.
Allelic imbalances can be inferred for each SNP from NA12878_AS_SNPs.vcf file. The file has counts for ref and alt alleles. Is it what you are asking?

We also noted that some of the genes on the "Genes with allele specific expression" have heterozygous SNPs in them that are not on the "SNP’s resulting in allele specific behavior table" and vice versa, so we weren’t exactly sure how to interpret that difference.
If you point us to few such cases we would be happy to look and resolve the inconsistencies. However, my feeling is that, it is probably reflects differences in gene annotation in different databases.

Your paper sounds very cool. I would be happy to read it when it comes out or even before that (if you consider that possible).