Pseudogene ontology

Q:
I’m currently working on reasoning with ontologies, and found in big
interest in reading your paper about reasoning of the pseudogene set.
I’m particularly interested in the time you give for the reasoning to
complete, since performance is my biggest issue. So I wanted to ask
you what was the size of the data you used, since the pseudogene set
does not seem to be available online anymore.

A:
The pseudogene data set used in this paper can be found at:
http://www.pseudogene.org/sdpgenes/all_pgenes/, which is based on
Ensembl genome release 48. There are 2,294 duplicated pseudogenes and
10,187 processed pseudogenes, in chromosome 1-22, X and Y. Hope this
helps.

PEMer for commercial use

Q:
I work for Novartis Institutes For Biomedical Research Inc, in Cambridge, MA, a commercial entity.

Could you use your PEMer software ? ( I remember you allowed me to try your translocation detection software in 2011, but I did not archive that email in 2011?

A:
this is fine.

no output files from the multichain server

Q:
One of my friends recommended Molmovdb to me to calculate the morph conformation. As a test, I have successfully generated the morphing conformation from the single chain server.

Then I tried to use the "multichain server" to generate the morphing conformations of my target proteins last Friday. Everything goes well. However, I did not receive any mails reminding me the progress until today. So I am wondering if I should wait more days.

Just in case, I submitted the same job again today. The job No. is b337528-18153. Would you like to give me a favor to check the progress?

A:
Thank you very much for your query, and for your interest in our server. We have been having problems with our server not sending emails, and we have a notice on our website notifying users about these email issues.

However, the good news is that you should still be able to access morphs. The best approach would be to just append your job ID to the standard URL. Thus, in your case, this would be:

http://www.molmovdb.org/cgi-bin/morph.cgi?ID=b337528-18153

If you are still unable to see your morphs, you may email me the structures, and I will try to generate them for you through our server.

Request for data in PLoS Computational Biology paper: Construction and Analysis of an Integrated Regulatory Network Derived from High-Throughput Sequencing Data

Q:
I am currently working on
a network science project studying properties of heterogenous networks and greatly intrigued by your 2011 paper in PLoS
Computational Biology:

Construction and Analysis of an Integrated Regulatory Network Derived from High-Throughput Sequencing Data

I am planning to employ the integrated human TF-miRNA-gene regulatory network constructed in your work to verify the
utility of information flow – based techniques in understanding the mapping between network topology and function. However,
the full network does not appear to be available in the supplementary information. I am writing to kindly ask if I could obtain
a copy of the human network data (e.g. CSV format edge list) for my research. I would be more than honored to be able to use
the original dataset in my work, and my apologies if it is against your plans to disclose it or it is available somewhere else that
I am not aware of. Thank you very much!

A:
you can certainly get the network.

The data behind it and a closely related network is available from :

http://papers.gersteinlab.org/papers/mirnet

http://papers.gersteinlab.org/papers/wormawg

(see website links)

meaning of CCA result

Q:
Just now, I have send you an Email, and ask the question of how to plot
CCA structural correlations figure. Now I have gotten that figure (please
see the attachment), But I can’t see anything from this figure, I only know
blue triangle is represent environment factor, and red diamond is for
metabolic pathway. But I don’t know what does this specified red diamond
represent which pathway, and what is for this triangle?

A:
you might find my lecture on the subject useful:
http://lectures.gersteinlab.org/summary/Networks-201102240-i0at10+kitp/

http://archive.gersteinlab.org/mark/out/log/2014/02.02/class/
from
http://www.gersteinlab.org/courses/452/

help in alleleseq – with data

Q1:
I am writing to you for asking help on AlleleSeq pipeline

We are very interested in using AlleleSeq in our analysis as part
of ENCODE, and has done the analysis on many ENCODE ChIP-seq data
sets.

However, in our recent small scale simulation study (we simulated
reads from a few dozen SNPs locations), we have consistently got
very unexpected results, which makes us think maybe we did
something
wrong in the installation and configuration of AlleleSeq.

I am wondering whether you or anyone else in the lab can help us
run AlleleSeq on our small toy data, so that we can compare the
results and find out what is going on.

It will just take you a few minutes, at most 30 minutes running
time, since we use the same SNP file and diploid sequence of
NA12878
in our simulation and no peak file is involved.

We will really appreciate your help. Thank you very much.

I will send you the simulated data in a separate email without cc
to so many people. Thanks.

To everyone else in the email list: please do not hesitate if you
have any comment on this issue. Thanks.

A1:
I took a look at the gz file you sent. It appears to be just the
reads. In order to run AlleleSeq, I would need:
– the set of snps you are using
– the personal genome you are using

Also, can you confirm which version of AlleleSeq you are using, and
where you obtained it (just so I’m sure I use the same code)

Can you tell me exactly how you invoked it?

What results did you expect, and what did you see?

Q2:
I double-checked the input data we used with the one who downloaded
it and what we used are in

http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_dec16.2012.zip

[1]

and

http://alleleseq.gersteinlab.org/NA12878_diploid_dec16.2012.alleleseq.input.zip

[2]

Please let me know if it clarifies things. Thanks.

A2:
I’ve spent some time looking this over. Can you explain exactly how
the toy reads were generated? It appears to me that the snps in the
reads are in the wrong place, i.e. off by one.
For example, if I blat read number 59 , the mismatch appears at
reference position 671855, rather than the expected snp location
671856. See the attached jpeg. All of the coordinates used in
AlleleSeq are one based, EXCEPT for the output from bowtie, which
internally convert to 1 based.

Q3:
The sequences are extracted from a BSgenome object I created from the .fa
files of paternal and maternal sequences. The .fa files are the same as that
used in buidling the Bowtie indices.

The sequences are extracted without error, and the 3rd base-pair at 5′ end
is supposed to be the snp location.

I do not fully understand what you mean. Do you mean the SNPs are mis-placed
in the reads?

I am wondering whether one based/zero based coordinate could be a problem.
As long as the reads are identical to the sequence in the .fa file, the
coordinate difference shall not be a problem, right?

As a note, previously, another member in my group found out another issue in
AlleleSeq about the ordering in Bowtie output if multi cores are used for
alignment. We have fixed it in our own AlleleSeq pipeline and suggested the
change to Gerstein privately. I guess what you found is different. Thank you
again for helping us in this issue.

A3:
Can you check something for me? See if the definition for MAPS in
PIPELINE.mk you used to run matches the actual file names of the map files.
I don’t believe you ever sent your PIPELINE.mk to me. If you wouldn’t mind,
please send me the PIPELINE.mk that you are using, and a listing of the
contents of the directory where the genomes and map files are. It turns out
that, unfortunately, the current code fails silently (and badly) if the
names don’t match. I will need to fix that, and I’m hoping that is the
problem you ran into.

Secondly, I noticed that the snp calls from
NA12878_diploid_dec16.2012.alleleseq.input (NA12878.hiseq.snp.call) contain
Y snps. NA12878 is female, so that seems odd. It also causes the code to
break, because there is no map file for Y (reasonably enough).

When I addressed both of these issues, I got sensible output from your toy
input set (which I will attach).

Q4:
Thank you very much for bringing up the issue of Y snps breaking down the
pipeline.

Another member in my group was trying to install AlleleSeq, and failed to
run the make file. Then she worked out a .sh file to run AlleleSeq without
changing any code, and that is what I am using. Please see the attached for
the script, and the following for the file names in the folder of input
data.

On the other hand, I do not understand what you mean by failing when the
names do not match. I am not sure if it is the problem we have. Please see
the attached for our output of the .sh version of AlleleSeq on the toy
example. Thanks.

Qi

ls -R
.:
fa/ index/ map/ NA12878.hiseq.snp.call* NA12878.hiseq.snp.call.RDATA
rd_4327183snps_na12878_hg19.txt*

./fa:
maternal/ paternal/

./fa/maternal:
chr10_NA12878_maternal.fa chr16_NA12878_maternal.fa
chr21_NA12878_maternal.fa chr6_NA12878_maternal.fa maternal.ref.1.ebwt
chr11_NA12878_maternal.fa chr17_NA12878_maternal.fa
chr22_NA12878_maternal.fa chr7_NA12878_maternal.fa maternal.ref.2.ebwt
chr12_NA12878_maternal.fa chr18_NA12878_maternal.fa
chr2_NA12878_maternal.fa chr8_NA12878_maternal.fa maternal.ref.3.ebwt
chr13_NA12878_maternal.fa chr19_NA12878_maternal.fa
chr3_NA12878_maternal.fa chr9_NA12878_maternal.fa maternal.ref.4.ebwt
chr14_NA12878_maternal.fa chr1_NA12878_maternal.fa
chr4_NA12878_maternal.fa chrX_NA12878_maternal.fa maternal.ref.rev.1.ebwt
chr15_NA12878_maternal.fa chr20_NA12878_maternal.fa
chr5_NA12878_maternal.fa maternal.chain maternal.ref.rev.2.ebwt

./fa/paternal:
chr10_NA12878_paternal.fa chr16_NA12878_paternal.fa
chr21_NA12878_paternal.fa chr6_NA12878_paternal.fa paternal.ref.1.ebwt
chr11_NA12878_paternal.fa chr17_NA12878_paternal.fa
chr22_NA12878_paternal.fa chr7_NA12878_paternal.fa paternal.ref.2.ebwt
chr12_NA12878_paternal.fa chr18_NA12878_paternal.fa
chr2_NA12878_paternal.fa chr8_NA12878_paternal.fa paternal.ref.3.ebwt
chr13_NA12878_paternal.fa chr19_NA12878_paternal.fa
chr3_NA12878_paternal.fa chr9_NA12878_paternal.fa paternal.ref.4.ebwt
chr14_NA12878_paternal.fa chr1_NA12878_paternal.fa
chr4_NA12878_paternal.fa chrX_NA12878_paternal.fa paternal.ref.rev.1.ebwt
chr15_NA12878_paternal.fa chr20_NA12878_paternal.fa
chr5_NA12878_paternal.fa paternal.chain paternal.ref.rev.2.ebwt

./index:
maternal.ref.1.ebwt* maternal.ref.3.ebwt* maternal.ref.rev.1.ebwt*
paternal.ref.1.ebwt* paternal.ref.3.ebwt* paternal.ref.rev.1.ebwt*
maternal.ref.2.ebwt* maternal.ref.4.ebwt* maternal.ref.rev.2.ebwt*
paternal.ref.2.ebwt* paternal.ref.4.ebwt* paternal.ref.rev.2.ebwt*

./map:
chr10_NA12878.map* chr14_NA12878.map* chr18_NA12878.map*
chr21_NA12878.map* chr4_NA12878.map* chr8_NA12878.map*
chr11_NA12878.map* chr15_NA12878.map* chr19_NA12878.map*
chr22_NA12878.map* chr5_NA12878.map* chr9_NA12878.map*
chr12_NA12878.map* chr16_NA12878.map* chr1_NA12878.map*
chr2_NA12878.map* chr6_NA12878.map* chrX_NA12878.map*
chr13_NA12878.map* chr17_NA12878.map* chr20_NA12878.map*
chr3_NA12878.map* chr7_NA12878.map*

A4:
Thanks, that was very helpful. If you look at the shell script, you will
see this line:

MAPS=$BASE/inputdata/map/%s_NA12878.map

MAPS should be a filename pattern that matches the names of the map files.
In your case those files are named chr1_NA12878.map (notice the extra
"chr").

If you change the MAPS= line to:
MAPS=$BASE/inputdata/map/chr%s_NA12878.map

that will fix that problem. I really have to apologize for this; the code
should not silently ignore this mismatch.

You’ll then run into the problem I mentioned wrt the Y snps. If you filter
out the Y snps from the snp file NA12878.hiseq.snp.call

$ grep -v Y NA12878.hiseq.snp.call NA12878.hiseq.snp.call.NO_Y

And then use that for your snp file throughout, I think you will be all set.
Please let me know if that is the case.