molmovdb job 014670-22217

Q:
I submitted a job to your morph server but after 4-5 days it is not yet completed. Could you please check if there is a problem or if I made a mistake in my submission?

I just gave two PDB codes for different conformations of maltose-binding protein (MBP). The two codes were 1OMP and 3MBP. They are both monomers and have the same number of residues, but 3MBP has a ligand bound.

A:
Indeed, it appears as if the issue has to do with PDB format irregularities. We have corrected these issues, and your morph may viewed by clicking the link below (please use Safari to view morphs, as Chrome and Firefox no longer support java):

http://www.molmovdb.org/cgi-bin/morph.cgi?ID=057837-877

Feel free to let us know if you experience any further difficulties. Also, if you like, we’d be happy to send you all of the accessory files associated with this morph.

molmovdb.org reboot?

Q:
Still getting this message after several days.

The job 072540-24927 is not yet completed

The two files were 1ohu chain A

1ty4 chain A

A:
It appears as if the issue has to do with PDB format irregularities. Specifically, the sequences within ATOM fields do no match the residues reported in the PDB file’s SEQRES field. In any case, we have corrected these, and your morph may viewed by clicking the link below (please use Safari to view morphs, as Chrome and Firefox no longer support java):

http://www.molmovdb.org/cgi-bin/morph.cgi?ID=056703-32536

OrthoClust questions

Q:
I am contacting you regarding the OrthoClust program that your group has on github and had a couple questions about how to apply the program to new datasets. First, how was the co-appearance matrix calculated from the OrthoClust output? Second, is it necessary to modify the initial number of spin states (q) or the coupling constant (k) parameters that were used in the 2014 Genome Biology paper? I am not able to find options in the current release and wondering where these values can be changed in the code?

A:
The current implementation in github is based on a heuristic, rather than the simulated annealing method used in the 2014 Genome Biology paper. The initial number of spin state q is no longer a parameter you have to supply. It’s set to be the total number of nodes in the system. As explained in the readme, the coupling constant k is supplied in one of the input files (the coupling information file). It should be the 3rd column of the file. in my example (ortho_info file found in data folder), the third column is all 1, meaning k=1.
For the co-occurrence matrix, notice that the output file is a tab delimited file which consists of three columns. The 1st and 2nd columns are the species id and the gene id given by the input files. The 3rd column is a module id. Suppose there are N1 genes in species 1 and N2 genes in species 2, the co-appearance matrix has dim (N1+N2) by (N1+N2). One should build a map between the genes in individual species to the indices running from 1 to (N1+N2). Suppose there are n genes in module 1, then all the pair-wise combination of these n genes should be marked as 1 in the corresponding matrix elements.
One output file can be used to make a co-appearance matrix (with only 0 and 1). If you have multiple output files from multiple runs of the algorithm, you will arrive at a final co-appearance matrix shown in the Genome biology paper by adding the results together. Of course, in order to make a plot like the heat map shown in the paper, one has to further perform clustering to arrange the rows and columns.
If you use Julia, I may be able to send you a little script.

Using LARVA and FunSeq2 for variant analysis

Q:
I have read your articles describing FunSeq2 and LARVA. I
find these two frameworks to be the most complete and well-adapted and so, I
am very interested in using them for my analysis. I have installed both
tools and started to run them following the instructions in the
documentation, but I am still encountering a few problems.

First, I have run the web-based version of FunSeq2 on several of my VCF
files and it seems to return the wanted result, with around 10,000+ entries
for each sample. However, when running the tool on the same files in command
line (with the -nc option), I obtain a different result, with no significant
entries returned.

The output returned is:

… Input format check : vcf …
… Format ok …
… Start filtering SNVs with minor allele frequency = 0 …
Warning: sample Sample1 – no SNVs left after filtering against natrual
variations …

I receive a similar result when attempting to run the program on multiple
files at once (both in command line and on the web).

I am also trying to use LARVA on these files; I have managed to install the
tool and I am currently testing it using the example-variants-1.txt file
from the regression suite as the variant file, but the program returns
“Segmentation Fault: 11” with no other error message.

Therefore, I would like to know if you have encountered these errors before
and if so, please let me know about any steps that I can try to correct
them.

A:
I’m glad to hear that you’ve decided to use LARVA for your analyses. I did some investigating with the LARVA codebase to try to figure out what might be causing the segmentation fault. One thing I found was that one of the helper scripts (bigWigAverageOverBed) is provided in its Linux (64-bit) version, so if you run LARVA on a different type of system (e.g. a Mac), the script won’t work. There are versions for other operating systems here (at the end of the page), but for simplicity we only provided the 64-bit Linux version. If that doesn’t fix the issue, could you please tell me everything you can about the environment in which you’re running LARVA (CPU, RAM, operating system, etc.) and the command line parameters you used.

Also, for help on Funseq2, I refer you to my colleague, Shake Lou (cc’ed).

One more thing I just thought of: how are all your input files formatted?

As to the issue about Funseq2, here is some suggestions:

1. The Funseq webserver version is obsolete, and we recommend you to use github version.
2. The latest 2.1.6 version has fixed a bug that might lead to some variant missed from the output.
3. Please use bed format as the output format. I will update vcf format output later.
4. You can also try funseq3.gersteinlab.org, which we have pre-calculated each position’s score for the hg19 genome. If you have a large number of variants to query, we have another good news. We are also testing a rich format whole genome Funseq output file and can let you retrieve the Funseq annotation simply from the command line. If you are interested in this file, we can give you the pre-release testing once it passed our internal QC very soon.

Question re rice pseudogene

Q:
I am using your pseudogene dataset of rice to do some analysis. However, I found that you did not mention which Rice genome version you used for data analysis, so I cannot anchor the pseudogenes to the genome I used. Would you please give the information.

A:
As I understand you are using the pseudogenes described in this paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2708354/ . The Data source section of the paper highlights the fact that the annotations were done on the Rice genome version 5 from TIGR. You can find all the information regarding the rice genome version 5 at ftp://ftp.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa .

TIP for Mouse ChIP-seq TF profiles

Q:
I am interested in using your probabilistic method (TIP) on some mouse ChIP-seq data. I downloaded the code (http://archive.gersteinlab.org/proj/tftarget/) , but I see that the code is very specific to human genome. As you did some analysis on the mouse TFs as well in the paper, I was wondering if you have a modified code for the mouse genome ? I would appreciate any kind of help. Thank you !

A:
Our collaborator set up a websever to tun TIP at http://syslab3.nchu.edu.tw/TIP/
You can choose to upload your data and get it run on the server. If you have many ChIP-seq files, you can download the package from the website and run it in your own computer.

Loregic – further validation

Q:
I’ve been trying to apply the Loregic algorithm in other organisms in order to further validate the method, however I’m finding some inconsistencies that could be related to data manipulation (choosing datasets, merging and mean-centering samples).
Furthermore, I’ve also found those inconsistencies when trying to reproduce the analysis from yeast datasets provided in your publication (probably due to the same data manipulation issues described before).

Would you be able to provide a more in-depth protocol for using Loregic with multiple datasets (how you handled the data, for example) in order to improve the consistency of the method between labs?

A:
Yes, we normalized the yeast data. Here was how we preprocessed:

1) got time-series yeast cell cycle data (alpha, cdc15, cdc28) from
http://genome-www.stanford.edu/cellcycle/data/rawdata/combined.txt,
which were logarithm values.
2) standardized(2^(data)) s.t., each time point has mean=0, and sigma=1
3) binarized the standardized data using the function,
binarizeTimeSeries with ‘kmeans’ clustering in R package BoolNet.

Request for the pdf version of the article

Q:
Currently my research
area focuses on the whole genome sequencing (WGS) of Indian samples. However
during my PhD i have worked on the study copy number variation in Indian
population and its implication in health.

Can you please send the following article "The current excitement about
copy-number variation: how it relates to gene duplications and protein
families" in the pdf format for my reference.

A:
Thank you for requesting copies of some of my recent
papers. Essentially all of my work is available on-line. Go to:

http://papers.gersteinlab.org

and click on the appropriate "preprint" link. You will be get a
preprint or (if appropriate) journal reprint of the paper you want.
There should be NO password challenges or other barriers. Usually, the
papers are in PDF format but some are in HTML. (Other formats are
available directly from http://papers.gersteinlab.org/e-print.)

Please let me know if you have any problems with this service. If you
can’t get what you want, we can easily post you normal paper reprints.

Questions about the STRESS method

Q1:
I thoroughly enjoyed reading your recent article regarding allosteric hotspot
detection.

I am interested in using this for my proteins and have started submitting my
queries via the provided web server.

It has been running for quite a while, since thursday/friday and I wondered
whether if this was normal or whether something has gone wrong.

In addition to this, as I am wanting to do this for a few more PDBs, is
there an option for a batch input?

A1:
We’re happy to hear that you enjoyed reading about our work, though we apologize about the issues you’re experiencing w/the server. We are investigating this, but it appears as if there is a load issue (all four of our backend CPUs have been running 24/7 since the paper went online). We’ll send you further updates soon.

In the meantime, however, there are two alternative options, and both would also address the need for batch input with multiple structures (the server itself does not provide an option for batch input):

1) We would be more than happy to run as many structures as you like, and we’d start running your structures as soon as you send me the relevant PDBs. WE can run over 1000 structures if necessary (We’d be running them on Yale’s HPC machines, not on Amazon).

2) All of the source code is available through GitHub (github.com/gersteinlab/STRESS). If you already have MMTK installed, then everything should be ready to run on the PDB files with which you’re working.

Q2:
Thank you for your very helpful message. I am emailing off my different account as the other one is having trouble attaching documents.

It would be great if you could help run the batch query as I haven’t yet got MMTK installed on my computer. In total I have 163 structures and would be very grateful if you could run STRESS for them.

Here is the .tsv file attached which contains the PDB_ID and CHAIN_ID for my structures.

Just to note, in the file i sent you, the first PDB id is 1e50, for some reason google has changed that. I hope this is OK!

A2:
Most of your runs are now finished, and you can access them in the link below. There are also a few notes I should mention:

http://homes.gersteinlab.org/people/dc547/.M_Pang/

1) I noticed that some of your structures are NMR structures (as oppose to x-ray crystal structures). I should mention that I have not tested to STRESS framework for such structures, so I can’t say for sure how well it performs. Also, the big issue is that, since NMR structures are generally given as an ensemble, the question is: when running STRESS, which structure should be used? I took the first model in each NMR structure, but this is somewhat arbitrary. Thus, I would interpret the NMR structures with caution. A list of your NMR structures is pasted here:
1f5y.pdb EXPDTA SOLUTION NMR
1gd5.pdb EXPDTA SOLUTION NMR
1k1g.pdb EXPDTA SOLUTION NMR
1o4x.pdb EXPDTA SOLUTION NMR
1rmj.pdb EXPDTA SOLUTION NMR
1urf.pdb EXPDTA SOLUTION NMR
2cr4.pdb EXPDTA SOLUTION NMR
2dn4.pdb EXPDTA SOLUTION NMR
2e7b.pdb EXPDTA SOLUTION NMR
2edk.pdb EXPDTA SOLUTION NMR
2edl.pdb EXPDTA SOLUTION NMR
2js7.pdb1 EXPDTA SOLUTION NMR
2jwa.pdb EXPDTA SOLUTION NMR
2jzx.pdb1 EXPDTA SOLUTION NMR
2kn6.pdb1 EXPDTA SOLUTION NMR
2l4c.pdb1 EXPDTA SOLUTION NMR

2) In the link above, I’ve also provided a gz file of all the original PDB files used.

3) There was an error when running the surface module on 1m5o. I think that this is a result of the fact that this structure is mostly composed of nucleic acid (HETATMs in PDB records, which are removed prior to processing (MMTK fails on HETATMs)

4) The interior-module is not yet complete for all 147 of your structures. The following 4 structures are still running (I can send you the results once they’re complete):
2ozl
2zw3
3ezz
3hn3

That’s it for now. Please take a look at the output whenever you get a free moment, and let me know what you think.