Coevolution source code from download link is out of date

Q1:
As I was trying to run coevolution locally, the downloaded source code (http://coevolution.gersteinlab.org/coevolution/dist/coevolution.jar) was out of date (at least URLs of pfam and rcsb pdb). Although I corrected those URLs and recompiled the project, there were still some bugs as shown below. I am now struggling to fix them. Meanwhile, could you please help me for the problem? I will appreciate very much.

Buildfile: /home/xiety/software/coevolutiontool/build.xml

intra:
[java] Protein list: intraProteins.txt
[java] Data directory: data/intra
[java] Result directory: results/intra
[java] Download MSAs? true
[java] Download structures? true
[java] Compute residue distances? false
[java] Align PDB and MSA sequences? true
[java] Compute coevolution scores? true
[java] Compute shuffled coevolution scores? false
[java] Plot coevolution scores? false
[java] Analyze coevolution scores? true
[java] Terminate execution on error? false
[java] Alignment methods: [Pfam]
[java] Downloading PDB file for 1C3W… Done.
[java] Downloading Pfam MSA file for PF01036… Done.
[java] Downloading Pfam tree file for PF01036… Done.
[java] Aligning the sequences of 1C3W and BACR_HALSA in PF01036… Done.
[java] Sequence filtering options (for coevolution score computation, plotting and analysis)
[java] Maximum fraction of gaps per sequence: 1.0
[java] Maximum sequence similarity: 0.9
[java] Minimum number of sequences: 50
[java] Maximum number of sequences: 500
[java] Site filtering options (for coevolution score plotting and analysis)
[java] Maximum fraction of gaps per site: 0.1
[java] Maximum fraction of sequences having the same character: 1.0
[java] Site filtering options specific to intra-protein analysis
[java] Minimum site separation: 3
[java] Maximum fraction of sequences having connected gaps at a site pair: 0.1
[java]
[java]
[java] org.gersteinlab.coevolution.core.data.DataFormatException: Cannot find the separator between the ID and the positions.
[java] at org.gersteinlab.coevolution.core.data.PfamFormatUtil.parseId(PfamFormatUtil.java:34)
[java] at org.gersteinlab.coevolution.core.data.PfamFastaProteinSequence.<init>(PfamFastaProteinSequence.java:61)
[java] at org.gersteinlab.coevolution.core.io.PfamFastaProteinMsaReader.readNextSequence(PfamFastaProteinMsaReader.java:55)
[java] at org.gersteinlab.coevolution.core.io.MsaReader.readMsa(MsaReader.java:75)
[java] at org.gersteinlab.coevolution.core.tasks.MsaFilter.init(MsaFilter.java:128)
[java] at org.gersteinlab.coevolution.intra.Main.start(Main.java:387)
[java] at org.gersteinlab.coevolution.intra.Main.main(Main.java:754)

Also — It seems that the treeURL is not correct.
All necessary URLs are modified as below:

URL pdbURL = new URL("https://files.rcsb.org/download/" + pdbID.toUpperCase() + ".pdb.gz");
URL msaURL = new URL("http://pfam.xfam.org/family/alignment/download/format?alnType=" + (seedOnly ?"seed" :"full") + "&format=fasta&order=t&gaps=default&download=downloadD&acc=" + pfamID.toUpperCase());
URL msaURL = new URL("https://pfam.xfam.org/family/" + pfamID.toUpperCase() + "/alignment/" + (seedOnly ?"seed" :"full") + "/gzipped");
URL treeURL = new URL("https://pfam.xfam.org/family/" + pfamID.toUpperCase() + "/tree/download");

Updated Exceptions:
Buildfile: /home/xiety/software/coevolutiontool/build.xml

intra:
[java] Protein list: intraProteins.txt
[java] Data directory: data/intra
[java] Result directory: results/intra
[java] Download MSAs? true
[java] Download structures? true
[java] Compute residue distances? false
[java] Align PDB and MSA sequences? true
[java] Compute coevolution scores? true
[java] Compute shuffled coevolution scores? false
[java] Plot coevolution scores? false
[java] Analyze coevolution scores? true
[java] Terminate execution on error? false
[java] Alignment methods: [Pfam]
[java] Downloading PDB file for 1C3W… Done.
[java] Downloading Pfam MSA file for PF01036… Done.
[java] Downloading Pfam tree file for PF01036… Done.
[java] Aligning the sequences of 1C3W and BACR_HALSA in PF01036… Done.
[java] Sequence filtering options (for coevolution score computation, plotting and analysis)
[java] Maximum fraction of gaps per sequence: 1.0
[java] Maximum sequence similarity: 0.9
[java] Minimum number of sequences: 50
[java] Maximum number of sequences: 500
[java] Site filtering options (for coevolution score plotting and analysis)
[java] Maximum fraction of gaps per site: 0.1
[java] Maximum fraction of sequences having the same character: 1.0
[java] Site filtering options specific to intra-protein analysis
[java] Minimum site separation: 3
[java] Maximum fraction of sequences having connected gaps at a site pair: 0.1
[java] Performing sequence filtering of the MSA of PF01036 from Pfam…
[java]
[java]
[java] java.lang.IllegalArgumentException: Node [A0A1S9DF11_ASPOZ/53-285] cannot be found in the tree.
[java] at org.gersteinlab.coevolution.core.data.NewickTree.removeNode(NewickTree.java:112)
[java] at org.gersteinlab.coevolution.core.tasks.MsaFilter.filterSequences(MsaFilter.java:214)
[java] at org.gersteinlab.coevolution.intra.Main.start(Main.java:391)
[java] at org.gersteinlab.coevolution.intra.Main.main(Main.java:754)

A1:
Please check whether the tree file can be downloaded. If not, I think the problem can be easily fixed by changing the URL in the fourth line. This is likely caused by a change of the Pfam web site.

If the file can be downloaded but the error still exists, then it is likely more related to the ID of each sequence in the different file. Please check whether the ID "A0A1S9DF11_ASPOZ/53-285" can be found in the tree file.

Q2:
I have check the tree file and the ID "A0A1S9DF11_ASPOZ/53-285" dose not exist. Is the tree file right?

data/intra/PF01036.tree:
(((BACS2_HALSA/5-220:0.72860,(BACS2_HALMA/5-224:0.61512,BACS2_NATPH/5-223:0.59067)0.650:0.09815)0.820:0.08903,(C7P1Y4_HALMD/5-221:1.18066,D3SUL9_NATMM/5-219:1.46984)0.970:0.58374)0.700:0.08760,(BACH_NATPH/35-274:1.24503,(BACR_HALAR/8-238:0.28604,(BACR_HALSA/23-247:0.30384,BACR1_HALC1/22-246:0.37112)0.960:0.20210)0.830:0.21189)0.960:0.32269,(B6BSG6_9PROT/34-253:1.66008,((B5RTR5_DEBHA/38-284:0.28614,(A3LUH9_PICST/37-279:0.24739,(C4YF64_CANAW/43-284:0.53146,B9W6Y7_CANDC/40-281:0.13058)1.000:0.31032)0.820:0.12498)0.910:0.32193,(C5E3Q5_LACTC/36-281:0.56486,C5DYF7_ZYGRC/38-283:0.47362)0.810:0.26951)1.000:1.78728)0.700:0.15130);

A2:
Then the MSA file and tree file from Pfam do not match. I am not sure why it happens. Maybe one is seed alignment and the other is full alignment?

A3:
Yes, you are right. The tree file is seed alignment but the MSA file is full alignment. There is only one URL for the tree file in the pfam web site (https://pfam.xfam.org/family/PF01036#tabview=tab5).

If both files are seed alignment, then the project will give exception "Not enough sequences". The default value of minSeqCount in intra.config file is 50 but only 16 sequences are left.
I think about two ways to fix the problem. One is setting smaller value of minSeqCount. The other one is localizing the method of generating the tree file (FastTree) and then generate full alignment formatted tree file.
Actually, I don’t know the difference of full alignment and seed alignment for the project.

If you think the 16 seed sequences are enough, you can bypass the minimum threshold. On the web site, you can find that option by clicking the "Show advanced options" link at the bottom.

But if you need more sequences, then either you produce the tree by yourself, or use a method that does not require the tree.

Problem to use coevolution server

Q:
I want to use your co-evolution website to find my co-evolution amino
acid in my interested protein. I found my proteins family in different
species by blast. but when i uploaded it to the server. server sent me an
email that told me your sequences are not enough.i loaded an example in your
server to found out how it worked but it had a problem too and error
message sent to me.please help me to how to use this server.

A:
We set the minimum number of sequences as 50 to ensure statistical significance
of the results. If you really want to run with fewer sequences, you can change
the number in the advanced options.

As for the other error, please send me the error message if you still cannot run
tasks successfully.

Query on SCA coevolution

Q:
I have conducted structure based statistical coupling analyses (SCA) on each
of some mitochondrial proteins using 800 multiple sequences (including one
sequence from our organisms, one 3RKO structure sequence, and 788 protein
sequences from different genera), and we could obtain the coevolutionary
scores and spatial distances between any pair of two residues. The aim of
our study is try to analyze the coevolutionary role of some important given
residues (selected by PAML analyses) on key or important residues
responsible for proton translocation in the proton translocating channel of
respiratory Complex I. The problem is we are not sure how to do it in a more
statistical way. Such as, we could have the data of scores and distances of
a given selected residue on these residues in proton channel or other
residues of the same protein. In order to know possible different
coevolutionary role of a given residue i.e. the selective residue on proton
channel residues or other residues, t-test on scores (s), or distances (d)
or sores/distinces (s/d) were compared by us between those types of
residues, we are not sure if this kind of analyse is ok for us. Such as we
don’t know whether the score obtained by SCA analyses in the platform has
alreadly considered the potential role of distance, or it is just the score
obtained no mattter where both residues are? We know the influencing role
between any two given residues might be correlated with both their
characteristics and spatial distance between them.

Do you have any good idea on this, or do you have more reasonable
statistical way to solve our queries and prolem above?

A:
The scores were calculated based on the MSA alone without
considering the spatial distance between residues.

You may want to plot the global distribution of scores, and look
for scores that are significantly larger than the rest but cannot be
explained by the distance on the primary sequence alone. Indirect
coupling between residues though other residues is also something to be
aware of. There have been a lot of new papers about co-evolutionary
analysis lately (e.g., from Rama Ranganathan’s and Debora Marks’s labs).

Submitting MSA to coevolution webpage

Q:
I am trying to submit an MSA to your "coevolution" webpage — and it keeps failing on me.
i keep getting an error email saying "could not be completed" Error message:Not enough sequences.

im sure that my MSA is in fasta format… so not really sure what is going wrong…
if you could help me out, it would be much appreciated.

A:
The tool performs a number of filtering steps, to ensure the
reliability of the results. The error messages states that after
filtering, the number of remaining sequences is too small for a
reliable analysis. You may change the filtering criteria using the
advanced options (hidden by default), but please notice that by doing
so you may get results that are unreliable.

Task failed coevolution

Q:
The seed alignment of PF01036 has been changed recently.
BACR_HALSA is no longer one of the seed sequences. I have just changed
the example to use BACR_HALAR instead, and the program ran fine.
Please let me know if you encounter any problem running the program on
your own data.

A:
I’ve been trying to use your server, but evidently it is not working
correctly; your example data even fails to process.

I am eager to use this server (and likewise cite you), so please let
me know at your earliest convenience if/how I can use this server.

coevolution webserver

Q:

I have read the article “An integrated system for studying residue coevolution in proteins” and start to use the coevolution webserver http://coevolution.gersteinlab.org/coevolution/. The help page says SCA is one of the coevolution score function, which I can‘t find on the webpage. Could you please tell me what is wrong?

A:
SCA is only available in the version for download, but not at the Web site anymore since it sometimes ran for an extremely long time. If you want to use the SCA method, you can use the download version or the latest SCA method by the Ranganathan group, which I think is better than the version we implemented.

Questions re tool for coevolution analysis

Q:
We are interested in
adding a link to your tool on the website. I have been playing around with
the tool but have been having some difficulty. I loaded the example, ran the
analysis and downloaded the results with no problem. Then, I tried to use
the data from the example to re-run the analysis by uploading the PF01036
fasta file (as an MSA) and listing BACR_HALSA as the reference sequence. I
did not load the tree data or the structure data. I submitted the request
and received the following error:

The coevolution analysis task that you submitted at 2013-09-11 11:35:07.0
could not be completed.

Error message:Not enough sequences.

In order to justify the addition of the link to the website, we need to make
sure that the web interface is simple and easy to use. Can you help me
understand what the problem might be (clearly I have enough sequences since
the example runs without any issues)? Also, can you explain to me the link
between the 1C3W reference structure and the sequence data?

Any help you can give would be greatly appreciated.

A:
The error message is due to an internal check of the system. Since
coevolution analysis requires a good number of sequences to give
reliable statistics, there is a minimum threshold of the number of
sequences after the filtering step. When the example is loaded, I think
some of the filtering settings are made so that it can pass the minimum
threshold. When one manually uploads the MSA, the default settings could
be different.

You may try changing the setting for the minimum number of
sequences in the "Advanced options" section (which is by default hidden)
and rerun. Let me know if you still get any error message.

Completion of job on coevolution server

Q:

we are trying to find the co-evolving positions in a protein family of interest. I had submitted a job on the co-evolution server several days back, but I have not received a response yet. Could you please let me know the estimated time of completion of my job?

A:
There was a long queue of pending tasks, and one of had been stuck in the queue for some time. I have removed it to let the others run. Please see if you can get your results within a day. If not, please let me know and I will check the system again.