help in alleleseq – with data

Q1:
I am writing to you for asking help on AlleleSeq pipeline

We are very interested in using AlleleSeq in our analysis as part
of ENCODE, and has done the analysis on many ENCODE ChIP-seq data
sets.

However, in our recent small scale simulation study (we simulated
reads from a few dozen SNPs locations), we have consistently got
very unexpected results, which makes us think maybe we did
something
wrong in the installation and configuration of AlleleSeq.

I am wondering whether you or anyone else in the lab can help us
run AlleleSeq on our small toy data, so that we can compare the
results and find out what is going on.

It will just take you a few minutes, at most 30 minutes running
time, since we use the same SNP file and diploid sequence of
NA12878
in our simulation and no peak file is involved.

We will really appreciate your help. Thank you very much.

I will send you the simulated data in a separate email without cc
to so many people. Thanks.

To everyone else in the email list: please do not hesitate if you
have any comment on this issue. Thanks.

A1:
I took a look at the gz file you sent. It appears to be just the
reads. In order to run AlleleSeq, I would need:
– the set of snps you are using
– the personal genome you are using

Also, can you confirm which version of AlleleSeq you are using, and
where you obtained it (just so I’m sure I use the same code)

Can you tell me exactly how you invoked it?

What results did you expect, and what did you see?

Q2:
I double-checked the input data we used with the one who downloaded
it and what we used are in

http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_dec16.2012.zip

[1]

and

http://alleleseq.gersteinlab.org/NA12878_diploid_dec16.2012.alleleseq.input.zip

[2]

Please let me know if it clarifies things. Thanks.

A2:
I’ve spent some time looking this over. Can you explain exactly how
the toy reads were generated? It appears to me that the snps in the
reads are in the wrong place, i.e. off by one.
For example, if I blat read number 59 , the mismatch appears at
reference position 671855, rather than the expected snp location
671856. See the attached jpeg. All of the coordinates used in
AlleleSeq are one based, EXCEPT for the output from bowtie, which
internally convert to 1 based.

Q3:
The sequences are extracted from a BSgenome object I created from the .fa
files of paternal and maternal sequences. The .fa files are the same as that
used in buidling the Bowtie indices.

The sequences are extracted without error, and the 3rd base-pair at 5′ end
is supposed to be the snp location.

I do not fully understand what you mean. Do you mean the SNPs are mis-placed
in the reads?

I am wondering whether one based/zero based coordinate could be a problem.
As long as the reads are identical to the sequence in the .fa file, the
coordinate difference shall not be a problem, right?

As a note, previously, another member in my group found out another issue in
AlleleSeq about the ordering in Bowtie output if multi cores are used for
alignment. We have fixed it in our own AlleleSeq pipeline and suggested the
change to Gerstein privately. I guess what you found is different. Thank you
again for helping us in this issue.

A3:
Can you check something for me? See if the definition for MAPS in
PIPELINE.mk you used to run matches the actual file names of the map files.
I don’t believe you ever sent your PIPELINE.mk to me. If you wouldn’t mind,
please send me the PIPELINE.mk that you are using, and a listing of the
contents of the directory where the genomes and map files are. It turns out
that, unfortunately, the current code fails silently (and badly) if the
names don’t match. I will need to fix that, and I’m hoping that is the
problem you ran into.

Secondly, I noticed that the snp calls from
NA12878_diploid_dec16.2012.alleleseq.input (NA12878.hiseq.snp.call) contain
Y snps. NA12878 is female, so that seems odd. It also causes the code to
break, because there is no map file for Y (reasonably enough).

When I addressed both of these issues, I got sensible output from your toy
input set (which I will attach).

Q4:
Thank you very much for bringing up the issue of Y snps breaking down the
pipeline.

Another member in my group was trying to install AlleleSeq, and failed to
run the make file. Then she worked out a .sh file to run AlleleSeq without
changing any code, and that is what I am using. Please see the attached for
the script, and the following for the file names in the folder of input
data.

On the other hand, I do not understand what you mean by failing when the
names do not match. I am not sure if it is the problem we have. Please see
the attached for our output of the .sh version of AlleleSeq on the toy
example. Thanks.

Qi

ls -R
.:
fa/ index/ map/ NA12878.hiseq.snp.call* NA12878.hiseq.snp.call.RDATA
rd_4327183snps_na12878_hg19.txt*

./fa:
maternal/ paternal/

./fa/maternal:
chr10_NA12878_maternal.fa chr16_NA12878_maternal.fa
chr21_NA12878_maternal.fa chr6_NA12878_maternal.fa maternal.ref.1.ebwt
chr11_NA12878_maternal.fa chr17_NA12878_maternal.fa
chr22_NA12878_maternal.fa chr7_NA12878_maternal.fa maternal.ref.2.ebwt
chr12_NA12878_maternal.fa chr18_NA12878_maternal.fa
chr2_NA12878_maternal.fa chr8_NA12878_maternal.fa maternal.ref.3.ebwt
chr13_NA12878_maternal.fa chr19_NA12878_maternal.fa
chr3_NA12878_maternal.fa chr9_NA12878_maternal.fa maternal.ref.4.ebwt
chr14_NA12878_maternal.fa chr1_NA12878_maternal.fa
chr4_NA12878_maternal.fa chrX_NA12878_maternal.fa maternal.ref.rev.1.ebwt
chr15_NA12878_maternal.fa chr20_NA12878_maternal.fa
chr5_NA12878_maternal.fa maternal.chain maternal.ref.rev.2.ebwt

./fa/paternal:
chr10_NA12878_paternal.fa chr16_NA12878_paternal.fa
chr21_NA12878_paternal.fa chr6_NA12878_paternal.fa paternal.ref.1.ebwt
chr11_NA12878_paternal.fa chr17_NA12878_paternal.fa
chr22_NA12878_paternal.fa chr7_NA12878_paternal.fa paternal.ref.2.ebwt
chr12_NA12878_paternal.fa chr18_NA12878_paternal.fa
chr2_NA12878_paternal.fa chr8_NA12878_paternal.fa paternal.ref.3.ebwt
chr13_NA12878_paternal.fa chr19_NA12878_paternal.fa
chr3_NA12878_paternal.fa chr9_NA12878_paternal.fa paternal.ref.4.ebwt
chr14_NA12878_paternal.fa chr1_NA12878_paternal.fa
chr4_NA12878_paternal.fa chrX_NA12878_paternal.fa paternal.ref.rev.1.ebwt
chr15_NA12878_paternal.fa chr20_NA12878_paternal.fa
chr5_NA12878_paternal.fa paternal.chain paternal.ref.rev.2.ebwt

./index:
maternal.ref.1.ebwt* maternal.ref.3.ebwt* maternal.ref.rev.1.ebwt*
paternal.ref.1.ebwt* paternal.ref.3.ebwt* paternal.ref.rev.1.ebwt*
maternal.ref.2.ebwt* maternal.ref.4.ebwt* maternal.ref.rev.2.ebwt*
paternal.ref.2.ebwt* paternal.ref.4.ebwt* paternal.ref.rev.2.ebwt*

./map:
chr10_NA12878.map* chr14_NA12878.map* chr18_NA12878.map*
chr21_NA12878.map* chr4_NA12878.map* chr8_NA12878.map*
chr11_NA12878.map* chr15_NA12878.map* chr19_NA12878.map*
chr22_NA12878.map* chr5_NA12878.map* chr9_NA12878.map*
chr12_NA12878.map* chr16_NA12878.map* chr1_NA12878.map*
chr2_NA12878.map* chr6_NA12878.map* chrX_NA12878.map*
chr13_NA12878.map* chr17_NA12878.map* chr20_NA12878.map*
chr3_NA12878.map* chr7_NA12878.map*

A4:
Thanks, that was very helpful. If you look at the shell script, you will
see this line:

MAPS=$BASE/inputdata/map/%s_NA12878.map

MAPS should be a filename pattern that matches the names of the map files.
In your case those files are named chr1_NA12878.map (notice the extra
"chr").

If you change the MAPS= line to:
MAPS=$BASE/inputdata/map/chr%s_NA12878.map

that will fix that problem. I really have to apologize for this; the code
should not silently ignore this mismatch.

You’ll then run into the problem I mentioned wrt the Y snps. If you filter
out the Y snps from the snp file NA12878.hiseq.snp.call

$ grep -v Y NA12878.hiseq.snp.call NA12878.hiseq.snp.call.NO_Y

And then use that for your snp file throughout, I think you will be all set.
Please let me know if that is the case.

Question about using AlleleSeq tool

Q:

I am a student doing a research project in allele-specific expression,
and am planning to use your lab’s AlleleSeq tool.

I am trying to use the YRI Population HapMap data. I went and tried to
find phased YRI trio data (from 1000 Genomes) to input into the
vcf2diploid tool. Unfortunately, I found data that includes only the
parent ID’s, but not the child’s. Since I don’t have the child data, I
am unable to use the AlleleSeq pipeline.

I was wondering if you could give me some suggestions on how to do ASE
given only the parental data.

A:
Thank you for your interest in the AlleleSeq pipeline.

The AlleleSeq pipeline assumes that the ‘child’ is the subject in which you
are trying to find ASE. Hence the genotypes of the subject are required.

Chain files in AlleleSeq

Q1:
We are very interested in your AlleleSeq package. I have downloaded the maternal.chain and paternal.chain from http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_dec16.2012.zip<http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_dec16.2012.zip.

From the documentation @http://info.gersteinlab.org/AlleleSeq:
Chain files
Using the chain file, one can use the LifeOver tool to convert the annotation coordinates from reference genome to personal haplotypes.

However, when I tried to liftOver my bed file using maternal.chain, all returned unMapped.

[liuh@helix NA12878_diploid_genome_dec16_2013]$ more maternal.chain
chain 249198044 1 249250621 + 0 249250621 1_maternal 249242013 + 0 249242013 1
10329 1 0
109 1 0
30199 3 0
43187 4 0
40 1 0

My bed file:
[liuh@helix bcf3]$ awk ‘{print $1,$2,$3}’ cyto.vcf.bed |head
chr1 14541 14542
chr1 14652 14653
chr1 14676 14677
chr1 14906 14907
chr1 14929 14930
chr1 15014 15015
chr1 16287 16288
chr1 16297 16298
chr1 16377 16378
chr1 16494 16495

My script:
module load ucsc; liftOver cyto.vcf.bed maternal.chain cyto.liftover unMapped.cyto

I tried to liftOver my bed from hg19 to hg18 without any problem. It means
that the bed file format should not have any issue.

A1:
thanks for interest to our software.
I may be wrong but it looks to me that the liftOver failed because of using different chromosome naming convention in .bed and .chain files.
In .bed file chromosomes are named with prefix ‘chr’, while in chain files they don’t have such prefix.

Q2:
Thanks so much for your time and kind help! It works well after I removed prefix ‘chr’ in bed file.

Do you have chain file(s) which can convert the annotation coordinates from maternal or paternal genome to reference genome?

I have mpileup and VarScan outputs with the annotation coordinates from maternal and paternal genome, and want to annotate it using annovar (but the annotation coordinates from the reference is required).

A2:
Incidentally, a previous user of AlleleSeq looking for conversion of mat/pat
to ref genome has given us an R code for public consumption. The only gripe
is we have not had time to test it extensively, hence we did not provide it
on the website previously.

I have the link here: http://alleleseq.gersteinlab.org/tools.html. The R
script converts mat/pat back to ref genome using the chain files provided by
AlleleSeq. Please see if it suits your purpose.

GM12878 Updated Phased Trancriptome

Q:
I noticed that you have an updated version of the GM12878 diploid genome
on the AlleleSeq webpage, but not an updated version of the transcriptome.
Would you be willing to upload an update for the transcriptome as well?
Also, are there major differences between the versions?

I actually now see how to generate the transcriptome myself using
vcf2diploid and the vcf file mentioned in the README (which I hadn’t noticed
before). Do you believe this vcf
(ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.2/hg19/CEUTrio.HiSeq.WGS.b37.bestPractices.phased.hg19.vcf.gz)
is the most accurate?

I’m actually trying to use liftover with the chain files from the newest diploid genotypes you created to update an annotation. However, the annotation ends up being empty. Have you experienced a similar problem? I’m using the hg19 UCSC annotation. Any help you can provide would be MUCH appreciated!

A:
Your chain file does not include ‘chr’ in front of the reference names.