Coevolution source code from download link is out of date

Q1:
As I was trying to run coevolution locally, the downloaded source code (http://coevolution.gersteinlab.org/coevolution/dist/coevolution.jar) was out of date (at least URLs of pfam and rcsb pdb). Although I corrected those URLs and recompiled the project, there were still some bugs as shown below. I am now struggling to fix them. Meanwhile, could you please help me for the problem? I will appreciate very much.

Buildfile: /home/xiety/software/coevolutiontool/build.xml

intra:
[java] Protein list: intraProteins.txt
[java] Data directory: data/intra
[java] Result directory: results/intra
[java] Download MSAs? true
[java] Download structures? true
[java] Compute residue distances? false
[java] Align PDB and MSA sequences? true
[java] Compute coevolution scores? true
[java] Compute shuffled coevolution scores? false
[java] Plot coevolution scores? false
[java] Analyze coevolution scores? true
[java] Terminate execution on error? false
[java] Alignment methods: [Pfam]
[java] Downloading PDB file for 1C3W… Done.
[java] Downloading Pfam MSA file for PF01036… Done.
[java] Downloading Pfam tree file for PF01036… Done.
[java] Aligning the sequences of 1C3W and BACR_HALSA in PF01036… Done.
[java] Sequence filtering options (for coevolution score computation, plotting and analysis)
[java] Maximum fraction of gaps per sequence: 1.0
[java] Maximum sequence similarity: 0.9
[java] Minimum number of sequences: 50
[java] Maximum number of sequences: 500
[java] Site filtering options (for coevolution score plotting and analysis)
[java] Maximum fraction of gaps per site: 0.1
[java] Maximum fraction of sequences having the same character: 1.0
[java] Site filtering options specific to intra-protein analysis
[java] Minimum site separation: 3
[java] Maximum fraction of sequences having connected gaps at a site pair: 0.1
[java]
[java]
[java] org.gersteinlab.coevolution.core.data.DataFormatException: Cannot find the separator between the ID and the positions.
[java] at org.gersteinlab.coevolution.core.data.PfamFormatUtil.parseId(PfamFormatUtil.java:34)
[java] at org.gersteinlab.coevolution.core.data.PfamFastaProteinSequence.<init>(PfamFastaProteinSequence.java:61)
[java] at org.gersteinlab.coevolution.core.io.PfamFastaProteinMsaReader.readNextSequence(PfamFastaProteinMsaReader.java:55)
[java] at org.gersteinlab.coevolution.core.io.MsaReader.readMsa(MsaReader.java:75)
[java] at org.gersteinlab.coevolution.core.tasks.MsaFilter.init(MsaFilter.java:128)
[java] at org.gersteinlab.coevolution.intra.Main.start(Main.java:387)
[java] at org.gersteinlab.coevolution.intra.Main.main(Main.java:754)

Also — It seems that the treeURL is not correct.
All necessary URLs are modified as below:

URL pdbURL = new URL("https://files.rcsb.org/download/" + pdbID.toUpperCase() + ".pdb.gz");
URL msaURL = new URL("http://pfam.xfam.org/family/alignment/download/format?alnType=" + (seedOnly ?"seed" :"full") + "&format=fasta&order=t&gaps=default&download=downloadD&acc=" + pfamID.toUpperCase());
URL msaURL = new URL("https://pfam.xfam.org/family/" + pfamID.toUpperCase() + "/alignment/" + (seedOnly ?"seed" :"full") + "/gzipped");
URL treeURL = new URL("https://pfam.xfam.org/family/" + pfamID.toUpperCase() + "/tree/download");

Updated Exceptions:
Buildfile: /home/xiety/software/coevolutiontool/build.xml

intra:
[java] Protein list: intraProteins.txt
[java] Data directory: data/intra
[java] Result directory: results/intra
[java] Download MSAs? true
[java] Download structures? true
[java] Compute residue distances? false
[java] Align PDB and MSA sequences? true
[java] Compute coevolution scores? true
[java] Compute shuffled coevolution scores? false
[java] Plot coevolution scores? false
[java] Analyze coevolution scores? true
[java] Terminate execution on error? false
[java] Alignment methods: [Pfam]
[java] Downloading PDB file for 1C3W… Done.
[java] Downloading Pfam MSA file for PF01036… Done.
[java] Downloading Pfam tree file for PF01036… Done.
[java] Aligning the sequences of 1C3W and BACR_HALSA in PF01036… Done.
[java] Sequence filtering options (for coevolution score computation, plotting and analysis)
[java] Maximum fraction of gaps per sequence: 1.0
[java] Maximum sequence similarity: 0.9
[java] Minimum number of sequences: 50
[java] Maximum number of sequences: 500
[java] Site filtering options (for coevolution score plotting and analysis)
[java] Maximum fraction of gaps per site: 0.1
[java] Maximum fraction of sequences having the same character: 1.0
[java] Site filtering options specific to intra-protein analysis
[java] Minimum site separation: 3
[java] Maximum fraction of sequences having connected gaps at a site pair: 0.1
[java] Performing sequence filtering of the MSA of PF01036 from Pfam…
[java]
[java]
[java] java.lang.IllegalArgumentException: Node [A0A1S9DF11_ASPOZ/53-285] cannot be found in the tree.
[java] at org.gersteinlab.coevolution.core.data.NewickTree.removeNode(NewickTree.java:112)
[java] at org.gersteinlab.coevolution.core.tasks.MsaFilter.filterSequences(MsaFilter.java:214)
[java] at org.gersteinlab.coevolution.intra.Main.start(Main.java:391)
[java] at org.gersteinlab.coevolution.intra.Main.main(Main.java:754)

A1:
Please check whether the tree file can be downloaded. If not, I think the problem can be easily fixed by changing the URL in the fourth line. This is likely caused by a change of the Pfam web site.

If the file can be downloaded but the error still exists, then it is likely more related to the ID of each sequence in the different file. Please check whether the ID "A0A1S9DF11_ASPOZ/53-285" can be found in the tree file.

Q2:
I have check the tree file and the ID "A0A1S9DF11_ASPOZ/53-285" dose not exist. Is the tree file right?

data/intra/PF01036.tree:
(((BACS2_HALSA/5-220:0.72860,(BACS2_HALMA/5-224:0.61512,BACS2_NATPH/5-223:0.59067)0.650:0.09815)0.820:0.08903,(C7P1Y4_HALMD/5-221:1.18066,D3SUL9_NATMM/5-219:1.46984)0.970:0.58374)0.700:0.08760,(BACH_NATPH/35-274:1.24503,(BACR_HALAR/8-238:0.28604,(BACR_HALSA/23-247:0.30384,BACR1_HALC1/22-246:0.37112)0.960:0.20210)0.830:0.21189)0.960:0.32269,(B6BSG6_9PROT/34-253:1.66008,((B5RTR5_DEBHA/38-284:0.28614,(A3LUH9_PICST/37-279:0.24739,(C4YF64_CANAW/43-284:0.53146,B9W6Y7_CANDC/40-281:0.13058)1.000:0.31032)0.820:0.12498)0.910:0.32193,(C5E3Q5_LACTC/36-281:0.56486,C5DYF7_ZYGRC/38-283:0.47362)0.810:0.26951)1.000:1.78728)0.700:0.15130);

A2:
Then the MSA file and tree file from Pfam do not match. I am not sure why it happens. Maybe one is seed alignment and the other is full alignment?

A3:
Yes, you are right. The tree file is seed alignment but the MSA file is full alignment. There is only one URL for the tree file in the pfam web site (https://pfam.xfam.org/family/PF01036#tabview=tab5).

If both files are seed alignment, then the project will give exception "Not enough sequences". The default value of minSeqCount in intra.config file is 50 but only 16 sequences are left.
I think about two ways to fix the problem. One is setting smaller value of minSeqCount. The other one is localizing the method of generating the tree file (FastTree) and then generate full alignment formatted tree file.
Actually, I don’t know the difference of full alignment and seed alignment for the project.

If you think the 16 seed sequences are enough, you can bypass the minimum threshold. On the web site, you can find that option by clicking the "Show advanced options" link at the bottom.

But if you need more sequences, then either you produce the tree by yourself, or use a method that does not require the tree.

Problem to use coevolution server

Q:
I want to use your co-evolution website to find my co-evolution amino
acid in my interested protein. I found my proteins family in different
species by blast. but when i uploaded it to the server. server sent me an
email that told me your sequences are not enough.i loaded an example in your
server to found out how it worked but it had a problem too and error
message sent to me.please help me to how to use this server.

A:
We set the minimum number of sequences as 50 to ensure statistical significance
of the results. If you really want to run with fewer sequences, you can change
the number in the advanced options.

As for the other error, please send me the error message if you still cannot run
tasks successfully.

Query on SCA coevolution

Q:
I have conducted structure based statistical coupling analyses (SCA) on each
of some mitochondrial proteins using 800 multiple sequences (including one
sequence from our organisms, one 3RKO structure sequence, and 788 protein
sequences from different genera), and we could obtain the coevolutionary
scores and spatial distances between any pair of two residues. The aim of
our study is try to analyze the coevolutionary role of some important given
residues (selected by PAML analyses) on key or important residues
responsible for proton translocation in the proton translocating channel of
respiratory Complex I. The problem is we are not sure how to do it in a more
statistical way. Such as, we could have the data of scores and distances of
a given selected residue on these residues in proton channel or other
residues of the same protein. In order to know possible different
coevolutionary role of a given residue i.e. the selective residue on proton
channel residues or other residues, t-test on scores (s), or distances (d)
or sores/distinces (s/d) were compared by us between those types of
residues, we are not sure if this kind of analyse is ok for us. Such as we
don’t know whether the score obtained by SCA analyses in the platform has
alreadly considered the potential role of distance, or it is just the score
obtained no mattter where both residues are? We know the influencing role
between any two given residues might be correlated with both their
characteristics and spatial distance between them.

Do you have any good idea on this, or do you have more reasonable
statistical way to solve our queries and prolem above?

A:
The scores were calculated based on the MSA alone without
considering the spatial distance between residues.

You may want to plot the global distribution of scores, and look
for scores that are significantly larger than the rest but cannot be
explained by the distance on the primary sequence alone. Indirect
coupling between residues though other residues is also something to be
aware of. There have been a lot of new papers about co-evolutionary
analysis lately (e.g., from Rama Ranganathan’s and Debora Marks’s labs).

Java chromod package request CoassociationAnalyzer.java and GSCCoassociationAnalyzer.java scripts that Kevin Yip wrote (April 14, 2011)

Q:
I’m writing to you to see if you could share with me your java "chromod" package – I’m wanting to use the CoassociationAnalyzer.java and GSCCoassociationAnalyzer.java scripts that Kevin Yip wrote (April 14, 2011), but they rely on the chromod package (package org.gersteinlab.chromod)

If you could share this with me if it’s not a top secret lab package, I would be hugely indebted!

A:
Please download it at http://www.cse.cuhk.edu.hk/~kevinyip/outbox/chromod.jar . Let me know if you encounter any problem when using it.

Submitting MSA to coevolution webpage

Q:
I am trying to submit an MSA to your "coevolution" webpage — and it keeps failing on me.
i keep getting an error email saying "could not be completed" Error message:Not enough sequences.

im sure that my MSA is in fasta format… so not really sure what is going wrong…
if you could help me out, it would be much appreciated.

A:
The tool performs a number of filtering steps, to ensure the
reliability of the results. The error messages states that after
filtering, the number of remaining sequences is too small for a
reliable analysis. You may change the filtering criteria using the
advanced options (hidden by default), but please notice that by doing
so you may get results that are unreliable.

Task failed coevolution

Q:
The seed alignment of PF01036 has been changed recently.
BACR_HALSA is no longer one of the seed sequences. I have just changed
the example to use BACR_HALAR instead, and the program ran fine.
Please let me know if you encounter any problem running the program on
your own data.

A:
I’ve been trying to use your server, but evidently it is not working
correctly; your example data even fails to process.

I am eager to use this server (and likewise cite you), so please let
me know at your earliest convenience if/how I can use this server.

Ask for help about your paper (PLoS ONE, 2010)

Q:
Currently, I’m working on reconstructing gene regulatory network. It’s
really an interesting topic and I would like to estimate which tools is
suitable for our experimental data. I have read your published paper
"Improved Reconstruction of In Silico Gene Regulatory Networks by
Integrating Knockout and Perturbation
Data". In this paper, I can’t understand the section of learning noise from
deletion data.
Step 1: Calculate the probability of regulation Pb->a for each pair of genes
(b,a). I want to know how to calculate this probability, and this value of
probability can decide potential regulation or not?
I want to ask you that how to work in this section, and I’m appreciated if
you can help me to figure out.

A: Basically we used the expression levels currently believed to be
unaffected by a deletion to form a Gaussian background. Then if a gene
has an expression level far away from the mean of this Gaussian
distribution (by calculating the probability that the expression is as
extreme or more extreme than the observed one based on the Gaussian), we
consider the gene to be affected by the deletion.

coevolution webserver

Q:

I have read the article “An integrated system for studying residue coevolution in proteins” and start to use the coevolution webserver http://coevolution.gersteinlab.org/coevolution/. The help page says SCA is one of the coevolution score function, which I can‘t find on the webpage. Could you please tell me what is wrong?

A:
SCA is only available in the version for download, but not at the Web site anymore since it sometimes ran for an extremely long time. If you want to use the SCA method, you can use the download version or the latest SCA method by the Ranganathan group, which I think is better than the version we implemented.

Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors

Q:
in your research paper "Classification of human genomic regions based
on experimentally determined binding sites of more than 100
transcription-related factors" you identified regions with extremely
high and low degrees of co-binding, termed respectively "HOT" and "LOT"
regions. Since I’m very interested in this classification, I tried to
reproduce this analysis on transcription-related factor (TRF) data of a
particular cell type.

I first downloaded the Genome Structure Correction scripts from

http://www.encodestatistics.org/svn/genome_structural_correction/python_encode_statistics/trunk/

and ran "block_bootstrap.py" on every pair of TRF data, thus obtaining
a matrix with z-scores. I then computed a raw z-score for each TRF,
defined as the average z-score with all other TRFs in the matrix. I
finally sorted these raw z-scores numerically and normalized them
linearly, so as to assign a weight of 1 to the TRF with the lowest raw
z-score and a weight of 1/n to the TRF with the highest raw z-score.
I’m afraid it is not clear to me what I should do next for the
identification of HOT and LOT regions: I would be very grateful if you
could help me with this analysis.

A:
The z-scores that you have computed can be considered as the global weights of the TRFs, where a TRF that more frequently binds to the same locations of other TRFs receive a lower weight, in order to de-emphasize the global co-binding effects.

For each bin j, the weighted binding score (i.e., degree of region-specific co-occurrence) was computed as d_j = \sum_i w_i s_ij, where i iterates through all TRFs, w_i is the weight of TRF i as defined above, and s_ij is its discretized binding signal at bin j (1-5, 5 for the top 25 percentile and 1 for zeros). The top 1% bins with the highest d_j were defined as the HOT regions, while the 1% bins with the lowest non-zero d_j were defined as the LOT regions.

Please notice that in the original calculations, when the block sampling program was run, the human genome was segmented into three classes, namely DNase hypersensitive peaks, DNase hypersensitive non-peak hotspots, and other regions. The idea was that the prior TRF binding probabilities in these three classes of region could be quite different, and thus they should be separately considered during the sampling process.

Questions re tool for coevolution analysis

Q:
We are interested in
adding a link to your tool on the website. I have been playing around with
the tool but have been having some difficulty. I loaded the example, ran the
analysis and downloaded the results with no problem. Then, I tried to use
the data from the example to re-run the analysis by uploading the PF01036
fasta file (as an MSA) and listing BACR_HALSA as the reference sequence. I
did not load the tree data or the structure data. I submitted the request
and received the following error:

The coevolution analysis task that you submitted at 2013-09-11 11:35:07.0
could not be completed.

Error message:Not enough sequences.

In order to justify the addition of the link to the website, we need to make
sure that the web interface is simple and easy to use. Can you help me
understand what the problem might be (clearly I have enough sequences since
the example runs without any issues)? Also, can you explain to me the link
between the 1C3W reference structure and the sequence data?

Any help you can give would be greatly appreciated.

A:
The error message is due to an internal check of the system. Since
coevolution analysis requires a good number of sequences to give
reliable statistics, there is a minimum threshold of the number of
sequences after the filtering step. When the example is loaded, I think
some of the filtering settings are made so that it can pass the minimum
threshold. When one manually uploads the MSA, the default settings could
be different.

You may try changing the setting for the minimum number of
sequences in the "Advanced options" section (which is by default hidden)
and rerun. Let me know if you still get any error message.