help in alleleseq – with data

Q1:
I am writing to you for asking help on AlleleSeq pipeline

We are very interested in using AlleleSeq in our analysis as part
of ENCODE, and has done the analysis on many ENCODE ChIP-seq data
sets.

However, in our recent small scale simulation study (we simulated
reads from a few dozen SNPs locations), we have consistently got
very unexpected results, which makes us think maybe we did
something
wrong in the installation and configuration of AlleleSeq.

I am wondering whether you or anyone else in the lab can help us
run AlleleSeq on our small toy data, so that we can compare the
results and find out what is going on.

It will just take you a few minutes, at most 30 minutes running
time, since we use the same SNP file and diploid sequence of
NA12878
in our simulation and no peak file is involved.

We will really appreciate your help. Thank you very much.

I will send you the simulated data in a separate email without cc
to so many people. Thanks.

To everyone else in the email list: please do not hesitate if you
have any comment on this issue. Thanks.

A1:
I took a look at the gz file you sent. It appears to be just the
reads. In order to run AlleleSeq, I would need:
– the set of snps you are using
– the personal genome you are using

Also, can you confirm which version of AlleleSeq you are using, and
where you obtained it (just so I’m sure I use the same code)

Can you tell me exactly how you invoked it?

What results did you expect, and what did you see?

Q2:
I double-checked the input data we used with the one who downloaded
it and what we used are in

http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_dec16.2012.zip

[1]

and

http://alleleseq.gersteinlab.org/NA12878_diploid_dec16.2012.alleleseq.input.zip

[2]

Please let me know if it clarifies things. Thanks.

A2:
I’ve spent some time looking this over. Can you explain exactly how
the toy reads were generated? It appears to me that the snps in the
reads are in the wrong place, i.e. off by one.
For example, if I blat read number 59 , the mismatch appears at
reference position 671855, rather than the expected snp location
671856. See the attached jpeg. All of the coordinates used in
AlleleSeq are one based, EXCEPT for the output from bowtie, which
internally convert to 1 based.

Q3:
The sequences are extracted from a BSgenome object I created from the .fa
files of paternal and maternal sequences. The .fa files are the same as that
used in buidling the Bowtie indices.

The sequences are extracted without error, and the 3rd base-pair at 5′ end
is supposed to be the snp location.

I do not fully understand what you mean. Do you mean the SNPs are mis-placed
in the reads?

I am wondering whether one based/zero based coordinate could be a problem.
As long as the reads are identical to the sequence in the .fa file, the
coordinate difference shall not be a problem, right?

As a note, previously, another member in my group found out another issue in
AlleleSeq about the ordering in Bowtie output if multi cores are used for
alignment. We have fixed it in our own AlleleSeq pipeline and suggested the
change to Gerstein privately. I guess what you found is different. Thank you
again for helping us in this issue.

A3:
Can you check something for me? See if the definition for MAPS in
PIPELINE.mk you used to run matches the actual file names of the map files.
I don’t believe you ever sent your PIPELINE.mk to me. If you wouldn’t mind,
please send me the PIPELINE.mk that you are using, and a listing of the
contents of the directory where the genomes and map files are. It turns out
that, unfortunately, the current code fails silently (and badly) if the
names don’t match. I will need to fix that, and I’m hoping that is the
problem you ran into.

Secondly, I noticed that the snp calls from
NA12878_diploid_dec16.2012.alleleseq.input (NA12878.hiseq.snp.call) contain
Y snps. NA12878 is female, so that seems odd. It also causes the code to
break, because there is no map file for Y (reasonably enough).

When I addressed both of these issues, I got sensible output from your toy
input set (which I will attach).

Q4:
Thank you very much for bringing up the issue of Y snps breaking down the
pipeline.

Another member in my group was trying to install AlleleSeq, and failed to
run the make file. Then she worked out a .sh file to run AlleleSeq without
changing any code, and that is what I am using. Please see the attached for
the script, and the following for the file names in the folder of input
data.

On the other hand, I do not understand what you mean by failing when the
names do not match. I am not sure if it is the problem we have. Please see
the attached for our output of the .sh version of AlleleSeq on the toy
example. Thanks.

Qi

ls -R
.:
fa/ index/ map/ NA12878.hiseq.snp.call* NA12878.hiseq.snp.call.RDATA
rd_4327183snps_na12878_hg19.txt*

./fa:
maternal/ paternal/

./fa/maternal:
chr10_NA12878_maternal.fa chr16_NA12878_maternal.fa
chr21_NA12878_maternal.fa chr6_NA12878_maternal.fa maternal.ref.1.ebwt
chr11_NA12878_maternal.fa chr17_NA12878_maternal.fa
chr22_NA12878_maternal.fa chr7_NA12878_maternal.fa maternal.ref.2.ebwt
chr12_NA12878_maternal.fa chr18_NA12878_maternal.fa
chr2_NA12878_maternal.fa chr8_NA12878_maternal.fa maternal.ref.3.ebwt
chr13_NA12878_maternal.fa chr19_NA12878_maternal.fa
chr3_NA12878_maternal.fa chr9_NA12878_maternal.fa maternal.ref.4.ebwt
chr14_NA12878_maternal.fa chr1_NA12878_maternal.fa
chr4_NA12878_maternal.fa chrX_NA12878_maternal.fa maternal.ref.rev.1.ebwt
chr15_NA12878_maternal.fa chr20_NA12878_maternal.fa
chr5_NA12878_maternal.fa maternal.chain maternal.ref.rev.2.ebwt

./fa/paternal:
chr10_NA12878_paternal.fa chr16_NA12878_paternal.fa
chr21_NA12878_paternal.fa chr6_NA12878_paternal.fa paternal.ref.1.ebwt
chr11_NA12878_paternal.fa chr17_NA12878_paternal.fa
chr22_NA12878_paternal.fa chr7_NA12878_paternal.fa paternal.ref.2.ebwt
chr12_NA12878_paternal.fa chr18_NA12878_paternal.fa
chr2_NA12878_paternal.fa chr8_NA12878_paternal.fa paternal.ref.3.ebwt
chr13_NA12878_paternal.fa chr19_NA12878_paternal.fa
chr3_NA12878_paternal.fa chr9_NA12878_paternal.fa paternal.ref.4.ebwt
chr14_NA12878_paternal.fa chr1_NA12878_paternal.fa
chr4_NA12878_paternal.fa chrX_NA12878_paternal.fa paternal.ref.rev.1.ebwt
chr15_NA12878_paternal.fa chr20_NA12878_paternal.fa
chr5_NA12878_paternal.fa paternal.chain paternal.ref.rev.2.ebwt

./index:
maternal.ref.1.ebwt* maternal.ref.3.ebwt* maternal.ref.rev.1.ebwt*
paternal.ref.1.ebwt* paternal.ref.3.ebwt* paternal.ref.rev.1.ebwt*
maternal.ref.2.ebwt* maternal.ref.4.ebwt* maternal.ref.rev.2.ebwt*
paternal.ref.2.ebwt* paternal.ref.4.ebwt* paternal.ref.rev.2.ebwt*

./map:
chr10_NA12878.map* chr14_NA12878.map* chr18_NA12878.map*
chr21_NA12878.map* chr4_NA12878.map* chr8_NA12878.map*
chr11_NA12878.map* chr15_NA12878.map* chr19_NA12878.map*
chr22_NA12878.map* chr5_NA12878.map* chr9_NA12878.map*
chr12_NA12878.map* chr16_NA12878.map* chr1_NA12878.map*
chr2_NA12878.map* chr6_NA12878.map* chrX_NA12878.map*
chr13_NA12878.map* chr17_NA12878.map* chr20_NA12878.map*
chr3_NA12878.map* chr7_NA12878.map*

A4:
Thanks, that was very helpful. If you look at the shell script, you will
see this line:

MAPS=$BASE/inputdata/map/%s_NA12878.map

MAPS should be a filename pattern that matches the names of the map files.
In your case those files are named chr1_NA12878.map (notice the extra
"chr").

If you change the MAPS= line to:
MAPS=$BASE/inputdata/map/chr%s_NA12878.map

that will fix that problem. I really have to apologize for this; the code
should not silently ignore this mismatch.

You’ll then run into the problem I mentioned wrt the Y snps. If you filter
out the Y snps from the snp file NA12878.hiseq.snp.call

$ grep -v Y NA12878.hiseq.snp.call NA12878.hiseq.snp.call.NO_Y

And then use that for your snp file throughout, I think you will be all set.
Please let me know if that is the case.

Leave a comment