AlleleDB: ChrX ASB data for CTCF?

Q:
I’m looking for data to tell me
whether there is ASB for CTCF at a handful of common SNPs on ChrX, but don’t
see ChrX in your database. Any chance you have this data anyway and could
look up a few common SNPs? Or know of another place I could find this info?

A:
Thank you for your interest in our AlleleDB resource! I am afraid the current AlleleDB version focuses on the autosomes only, and I haven’t worked on chr X data when I left. I have CCed Timur who has taken over the resource and might be able to provide more information/updates.

AlleleSeq CNV file from CNVnator

Q:
I have enjoyed your papers on allele specificity and I have a question about
using AlleleSeq. I understand time is short and valuable and would very much
appreciate it. I am making the CNV files with format:

chrm snppos rd
1 52066 0.902113
1 695745 0.909802
1 742429 0.976435

for input to the AlleleSeq pipeline.

I am using the alleleSeq_cnvScript tool to convert the output from CNVnator
v0.3, into the required CNV file, and it appears to work. However, my
problem is that it is running prohibitively slowly.

My alignment BAM is from 1000 genomes Phase III low coverage WGS, and is
24GB in size.

If I process only Chromosome 1, I have a ROOT file of 156 MB. I have six
million SNPs in a SNV file of 171 MB.

The addRD program runs using only 4% of the 16GB of RAM I have available,
but will take many weeks to complete at the current rate.

The rate at which addRD runs slows down dramatically with time. Though I am
not proficient in c++, I examined the code to see if I could identify why it
is slowing with time. I guess it is due to the search through the ROOT file
for each window around each SNP. The search restarts from the beginning for
each SNP, and so as the SNP locations become further along the chromosome
this search takes longer. I imagine that a great deal of time could be saved
by initialising each search based on the previous search?

If this approach is not possible for me, please could you advise on whether
the following algorithm would be appropriate for input to AlleleSeq:

1) divide the BAM file into windows of with W and count the number of reads
in each window.

2) Calculate the mean Read Depth (perhaps as a function of GC content): mu

1) for each SNP in my SNV file:

Use bamtools to select the reads in the window of size W centred on
the SNP location and calculate the Read Depth

(perhaps correct RD for GC content)

Calculate the normalised read depth = RD/(2 mu L/W)

output the SNP location and normalised read depth to file.

A:
Have you tried using the latest version of Personal Genome Constructor?

http://alleleseq.gersteinlab.org/vcf2diploid_v0.2.6a.zip

When generating the .cnv file suitable for AlleleSeq, the pipeline uses bedtools to get read depth around each hetSNP instead of CNVnator. From my experience, this doesn’t take more than a few hours on a ~100GB WGS .bam file (single thread, all chromosomes).

1000genomes allele proves japanese ancestry?

Q:
I’ve been messaging the
list of contributors to the 1000genomes project to help me confirm the data
from you project is the proof I need in my search for japanese ancestry, and
would really appreciate your approval on whether what I’ve found is the
breakthrough I’ve been searching for.

Thanks to the help of a population geneticist from china who has a good set
of japanese allele at their disposal, she was able to find two japanese
specific allele. One of which is rs184214090 AG, i carry this one. And if
you take a look at ensembl/1000genomes population frequency you will see
that the AG allele is only found in the japanese sample.

https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ensembl.org_Homo-5Fsapiens_Variation_Population-3Fr-3D19-3A43873182-2D43874182-3Bv-3Drs184214090-3Bvdb-3Dvariation-3Bvf-3D44631311&d=AwIDAg&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=k13szPaSoN-FV15yTT5Guyx7KbJMywtSANe0quSPJ-Q&m=Ayzbfe4LKLVqd9jn_XPrPiw1lN7agPrK9CcA_FutJec&s=9eV37Lanwkxb0uLLB7rJVO9yKv7dDES_gDLtTxjjL68&e=

I also spoke with Mr saitou naruya who is a very well cited japanese
geneticist and he told me that if I can find an allele that is specific
restricted to japan this is a good indication which is known as "private
polymorphism". Can you please give a quick confirmation to this whether I’ve
got this all right?

A:
I agree with what Mr Saitou said about private polymorphism.

AlleleSeq question

Q:
I would like to use AlleleSeq with exome data but am unsure how to generate the .cnv file. If I supply a BAM from my exome sequencing experiment then won’t this produce some very low average coverage if CNVnator examines coverage across the genome? This will mean that my exonic variants will be excluded because AlleleSeq interprets them as being in a CNV… I assume I have to supply a FASTA containing only the target regions of the exome enrichment – is this true?

A:
I haven’t used AlleleSeq with exome data (I will ask around some people in the lab and will probably write to you again next week), but I would try using the latest version of the pipeline: the 0.2.6a version of Personal Genome constructor (http://alleleseq.gersteinlab.org/tools.html) doesn’t use CNVnator, instead the pipeline uses bedtools and calculates median read depth across +/-1000bp regions around snps. So if all snps are exonic, the read depth around each snp will be compared to the median value across all the other ones. This might bias snps located close to exon/intron junctions, if that happens, I would consider decreasing the window size.

…We discussed your question in the lab and believe that it is better not to perform the cnv identification step at all: there should not be many of them in the exome, and the edge effect could be avoided as well. The latest version of the alleleseq pipeline still requires a .cnv file to run. So, the easiest solution is to create the .cnv file with the list of all the snps (as in the .snp file) with rd=1.0:

chrm snppos rd
1 52066 1.0
1 695745 1.0
1 742429 1.0
….

questions about AlleleDB and genotyping

Q:
I am using your lab database AlleleDB for allelic imbalance SNPs of ChIP-seq data and I really find it a formidable resource for further research.

I noticed that many of the significant SNPs for allelic imbalance are SNPs that have only the reference allele in the ChIP-seq data. See the first 10 lines as an example from your file ASB.auto.v2.1.aug16.txt.tgz that can be downloaded from AlleleDB database :

chr start end TF_indiv_ASB ref alt cA cC cG cT p.binomial p.betabinomial
chr1 91604 91605 SA1_NA19099_ASB C T 0 45 0 0 5.68E-14 0.000376021311925
chr1 714018 714019 POL2_NA18505_ASB A G 2 0 19 0 0.000221252441406 0.003114747121148
chr1 714018 714019 POL2_NA19099_ASB A G 2 0 19 0 0.000221252441406 0.002559182427128
chr1 762484 762485 RPB2_NA11894_ASB C A 0 11 0 0 0.0009765625 0.004566586027278
chr1 762600 762601 RPB2_NA11894_ASB T C 0 12 0 0 0.00048828125 0.002968280917731
chr1 840035 840036 SA1_NA18951_ASB T G 0 0 1 168 2.44E-37 0.001745366786663
chr1 937397 937398 SA1_NA19099_ASB G A 0 0 81 0 8.27E-25 5.03E-05
chr1 1167795 1167796 RPB2_NA11894_ASB C T 0 26 0 0 2.98E-08 2.29E-05
chr1 1407371 1407372 RPB2_NA11831_ASB T C 0 0 0 41 9.09E-13 0.000260638330353
chr1 1551110 1551111 POL2_NA19099_ASB C G 0 30 0 0 1.86E-09 1.67E-06

If I understand well allelic imbalance, only SNPs with genotypes that are heterozygous are tested for allelic imbalance.
In this case, the first SNP in the above table (for instance) has a genotype C/T, but shows 45 reads with C and 0 reads with T. This is indeed a strong allelic imbalance. In your data, there are many examples like that.

However, I heard many biologists saying that cell lines are not stable and mutations accumulate over time.
Is it possible that the first SNP is actually homozygous C/C in the cells used for ChIP-seq experiment because of mutations that occurred during cell culture, given that the genotyping by 1000 Genome project and the ChIP-seq experiment might have been done at different times or by different labs?

What do you think?

A:
Thank you for your interest in AlleleDB!
Right, only heterozygous SNVs are tested for the imbalance. We used the 1000G Project variants for these analyses, did not call them ourselves. Indeed, allelic calculations are very sensitive to genotype calls. Thus, a false-positive heterozygous SNV call, as well as a mutation in the cell-line as you described (not sure if there can be many of those), could lead to a false allelic imbalance call. If this is crucial for your analyses, a simple strategy would probably be filtering the call-set to keep only the SNVs where the ‘weaker’ allele is supported by at least one read, though this would remove some true strong imbalance calls as well.

help in alleleseq – with data

Q1:
I am writing to you for asking help on AlleleSeq pipeline

We are very interested in using AlleleSeq in our analysis as part
of ENCODE, and has done the analysis on many ENCODE ChIP-seq data
sets.

However, in our recent small scale simulation study (we simulated
reads from a few dozen SNPs locations), we have consistently got
very unexpected results, which makes us think maybe we did
something
wrong in the installation and configuration of AlleleSeq.

I am wondering whether you or anyone else in the lab can help us
run AlleleSeq on our small toy data, so that we can compare the
results and find out what is going on.

It will just take you a few minutes, at most 30 minutes running
time, since we use the same SNP file and diploid sequence of
NA12878
in our simulation and no peak file is involved.

We will really appreciate your help. Thank you very much.

I will send you the simulated data in a separate email without cc
to so many people. Thanks.

To everyone else in the email list: please do not hesitate if you
have any comment on this issue. Thanks.

A1:
I took a look at the gz file you sent. It appears to be just the
reads. In order to run AlleleSeq, I would need:
– the set of snps you are using
– the personal genome you are using

Also, can you confirm which version of AlleleSeq you are using, and
where you obtained it (just so I’m sure I use the same code)

Can you tell me exactly how you invoked it?

What results did you expect, and what did you see?

Q2:
I double-checked the input data we used with the one who downloaded
it and what we used are in

http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_dec16.2012.zip

[1]

and

http://alleleseq.gersteinlab.org/NA12878_diploid_dec16.2012.alleleseq.input.zip

[2]

Please let me know if it clarifies things. Thanks.

A2:
I’ve spent some time looking this over. Can you explain exactly how
the toy reads were generated? It appears to me that the snps in the
reads are in the wrong place, i.e. off by one.
For example, if I blat read number 59 , the mismatch appears at
reference position 671855, rather than the expected snp location
671856. See the attached jpeg. All of the coordinates used in
AlleleSeq are one based, EXCEPT for the output from bowtie, which
internally convert to 1 based.

Q3:
The sequences are extracted from a BSgenome object I created from the .fa
files of paternal and maternal sequences. The .fa files are the same as that
used in buidling the Bowtie indices.

The sequences are extracted without error, and the 3rd base-pair at 5′ end
is supposed to be the snp location.

I do not fully understand what you mean. Do you mean the SNPs are mis-placed
in the reads?

I am wondering whether one based/zero based coordinate could be a problem.
As long as the reads are identical to the sequence in the .fa file, the
coordinate difference shall not be a problem, right?

As a note, previously, another member in my group found out another issue in
AlleleSeq about the ordering in Bowtie output if multi cores are used for
alignment. We have fixed it in our own AlleleSeq pipeline and suggested the
change to Gerstein privately. I guess what you found is different. Thank you
again for helping us in this issue.

A3:
Can you check something for me? See if the definition for MAPS in
PIPELINE.mk you used to run matches the actual file names of the map files.
I don’t believe you ever sent your PIPELINE.mk to me. If you wouldn’t mind,
please send me the PIPELINE.mk that you are using, and a listing of the
contents of the directory where the genomes and map files are. It turns out
that, unfortunately, the current code fails silently (and badly) if the
names don’t match. I will need to fix that, and I’m hoping that is the
problem you ran into.

Secondly, I noticed that the snp calls from
NA12878_diploid_dec16.2012.alleleseq.input (NA12878.hiseq.snp.call) contain
Y snps. NA12878 is female, so that seems odd. It also causes the code to
break, because there is no map file for Y (reasonably enough).

When I addressed both of these issues, I got sensible output from your toy
input set (which I will attach).

Q4:
Thank you very much for bringing up the issue of Y snps breaking down the
pipeline.

Another member in my group was trying to install AlleleSeq, and failed to
run the make file. Then she worked out a .sh file to run AlleleSeq without
changing any code, and that is what I am using. Please see the attached for
the script, and the following for the file names in the folder of input
data.

On the other hand, I do not understand what you mean by failing when the
names do not match. I am not sure if it is the problem we have. Please see
the attached for our output of the .sh version of AlleleSeq on the toy
example. Thanks.

Qi

ls -R
.:
fa/ index/ map/ NA12878.hiseq.snp.call* NA12878.hiseq.snp.call.RDATA
rd_4327183snps_na12878_hg19.txt*

./fa:
maternal/ paternal/

./fa/maternal:
chr10_NA12878_maternal.fa chr16_NA12878_maternal.fa
chr21_NA12878_maternal.fa chr6_NA12878_maternal.fa maternal.ref.1.ebwt
chr11_NA12878_maternal.fa chr17_NA12878_maternal.fa
chr22_NA12878_maternal.fa chr7_NA12878_maternal.fa maternal.ref.2.ebwt
chr12_NA12878_maternal.fa chr18_NA12878_maternal.fa
chr2_NA12878_maternal.fa chr8_NA12878_maternal.fa maternal.ref.3.ebwt
chr13_NA12878_maternal.fa chr19_NA12878_maternal.fa
chr3_NA12878_maternal.fa chr9_NA12878_maternal.fa maternal.ref.4.ebwt
chr14_NA12878_maternal.fa chr1_NA12878_maternal.fa
chr4_NA12878_maternal.fa chrX_NA12878_maternal.fa maternal.ref.rev.1.ebwt
chr15_NA12878_maternal.fa chr20_NA12878_maternal.fa
chr5_NA12878_maternal.fa maternal.chain maternal.ref.rev.2.ebwt

./fa/paternal:
chr10_NA12878_paternal.fa chr16_NA12878_paternal.fa
chr21_NA12878_paternal.fa chr6_NA12878_paternal.fa paternal.ref.1.ebwt
chr11_NA12878_paternal.fa chr17_NA12878_paternal.fa
chr22_NA12878_paternal.fa chr7_NA12878_paternal.fa paternal.ref.2.ebwt
chr12_NA12878_paternal.fa chr18_NA12878_paternal.fa
chr2_NA12878_paternal.fa chr8_NA12878_paternal.fa paternal.ref.3.ebwt
chr13_NA12878_paternal.fa chr19_NA12878_paternal.fa
chr3_NA12878_paternal.fa chr9_NA12878_paternal.fa paternal.ref.4.ebwt
chr14_NA12878_paternal.fa chr1_NA12878_paternal.fa
chr4_NA12878_paternal.fa chrX_NA12878_paternal.fa paternal.ref.rev.1.ebwt
chr15_NA12878_paternal.fa chr20_NA12878_paternal.fa
chr5_NA12878_paternal.fa paternal.chain paternal.ref.rev.2.ebwt

./index:
maternal.ref.1.ebwt* maternal.ref.3.ebwt* maternal.ref.rev.1.ebwt*
paternal.ref.1.ebwt* paternal.ref.3.ebwt* paternal.ref.rev.1.ebwt*
maternal.ref.2.ebwt* maternal.ref.4.ebwt* maternal.ref.rev.2.ebwt*
paternal.ref.2.ebwt* paternal.ref.4.ebwt* paternal.ref.rev.2.ebwt*

./map:
chr10_NA12878.map* chr14_NA12878.map* chr18_NA12878.map*
chr21_NA12878.map* chr4_NA12878.map* chr8_NA12878.map*
chr11_NA12878.map* chr15_NA12878.map* chr19_NA12878.map*
chr22_NA12878.map* chr5_NA12878.map* chr9_NA12878.map*
chr12_NA12878.map* chr16_NA12878.map* chr1_NA12878.map*
chr2_NA12878.map* chr6_NA12878.map* chrX_NA12878.map*
chr13_NA12878.map* chr17_NA12878.map* chr20_NA12878.map*
chr3_NA12878.map* chr7_NA12878.map*

A4:
Thanks, that was very helpful. If you look at the shell script, you will
see this line:

MAPS=$BASE/inputdata/map/%s_NA12878.map

MAPS should be a filename pattern that matches the names of the map files.
In your case those files are named chr1_NA12878.map (notice the extra
"chr").

If you change the MAPS= line to:
MAPS=$BASE/inputdata/map/chr%s_NA12878.map

that will fix that problem. I really have to apologize for this; the code
should not silently ignore this mismatch.

You’ll then run into the problem I mentioned wrt the Y snps. If you filter
out the Y snps from the snp file NA12878.hiseq.snp.call

$ grep -v Y NA12878.hiseq.snp.call NA12878.hiseq.snp.call.NO_Y

And then use that for your snp file throughout, I think you will be all set.
Please let me know if that is the case.

Question about using AlleleSeq tool

Q:

I am a student doing a research project in allele-specific expression,
and am planning to use your lab’s AlleleSeq tool.

I am trying to use the YRI Population HapMap data. I went and tried to
find phased YRI trio data (from 1000 Genomes) to input into the
vcf2diploid tool. Unfortunately, I found data that includes only the
parent ID’s, but not the child’s. Since I don’t have the child data, I
am unable to use the AlleleSeq pipeline.

I was wondering if you could give me some suggestions on how to do ASE
given only the parental data.

A:
Thank you for your interest in the AlleleSeq pipeline.

The AlleleSeq pipeline assumes that the ‘child’ is the subject in which you
are trying to find ASE. Hence the genotypes of the subject are required.