Q1:
As I was trying to run coevolution locally, the downloaded source code (http://coevolution.gersteinlab.org/coevolution/dist/coevolution.jar) was out of date (at least URLs of pfam and rcsb pdb). Although I corrected those URLs and recompiled the project, there were still some bugs as shown below. I am now struggling to fix them. Meanwhile, could you please help me for the problem? I will appreciate very much.
Buildfile: /home/xiety/software/coevolutiontool/build.xml
intra:
[java] Protein list: intraProteins.txt
[java] Data directory: data/intra
[java] Result directory: results/intra
[java] Download MSAs? true
[java] Download structures? true
[java] Compute residue distances? false
[java] Align PDB and MSA sequences? true
[java] Compute coevolution scores? true
[java] Compute shuffled coevolution scores? false
[java] Plot coevolution scores? false
[java] Analyze coevolution scores? true
[java] Terminate execution on error? false
[java] Alignment methods: [Pfam]
[java] Downloading PDB file for 1C3W… Done.
[java] Downloading Pfam MSA file for PF01036… Done.
[java] Downloading Pfam tree file for PF01036… Done.
[java] Aligning the sequences of 1C3W and BACR_HALSA in PF01036… Done.
[java] Sequence filtering options (for coevolution score computation, plotting and analysis)
[java] Maximum fraction of gaps per sequence: 1.0
[java] Maximum sequence similarity: 0.9
[java] Minimum number of sequences: 50
[java] Maximum number of sequences: 500
[java] Site filtering options (for coevolution score plotting and analysis)
[java] Maximum fraction of gaps per site: 0.1
[java] Maximum fraction of sequences having the same character: 1.0
[java] Site filtering options specific to intra-protein analysis
[java] Minimum site separation: 3
[java] Maximum fraction of sequences having connected gaps at a site pair: 0.1
[java]
[java]
[java] org.gersteinlab.coevolution.core.data.DataFormatException: Cannot find the separator between the ID and the positions.
[java] at org.gersteinlab.coevolution.core.data.PfamFormatUtil.parseId(PfamFormatUtil.java:34)
[java] at org.gersteinlab.coevolution.core.data.PfamFastaProteinSequence.<init>(PfamFastaProteinSequence.java:61)
[java] at org.gersteinlab.coevolution.core.io.PfamFastaProteinMsaReader.readNextSequence(PfamFastaProteinMsaReader.java:55)
[java] at org.gersteinlab.coevolution.core.io.MsaReader.readMsa(MsaReader.java:75)
[java] at org.gersteinlab.coevolution.core.tasks.MsaFilter.init(MsaFilter.java:128)
[java] at org.gersteinlab.coevolution.intra.Main.start(Main.java:387)
[java] at org.gersteinlab.coevolution.intra.Main.main(Main.java:754)
Also — It seems that the treeURL is not correct.
All necessary URLs are modified as below:
URL pdbURL = new URL("https://files.rcsb.org/download/" + pdbID.toUpperCase() + ".pdb.gz");
URL msaURL = new URL("http://pfam.xfam.org/family/alignment/download/format?alnType=" + (seedOnly ?"seed" :"full") + "&format=fasta&order=t&gaps=default&download=downloadD&acc=" + pfamID.toUpperCase());
URL msaURL = new URL("https://pfam.xfam.org/family/" + pfamID.toUpperCase() + "/alignment/" + (seedOnly ?"seed" :"full") + "/gzipped");
URL treeURL = new URL("https://pfam.xfam.org/family/" + pfamID.toUpperCase() + "/tree/download");
Updated Exceptions:
Buildfile: /home/xiety/software/coevolutiontool/build.xml
intra:
[java] Protein list: intraProteins.txt
[java] Data directory: data/intra
[java] Result directory: results/intra
[java] Download MSAs? true
[java] Download structures? true
[java] Compute residue distances? false
[java] Align PDB and MSA sequences? true
[java] Compute coevolution scores? true
[java] Compute shuffled coevolution scores? false
[java] Plot coevolution scores? false
[java] Analyze coevolution scores? true
[java] Terminate execution on error? false
[java] Alignment methods: [Pfam]
[java] Downloading PDB file for 1C3W… Done.
[java] Downloading Pfam MSA file for PF01036… Done.
[java] Downloading Pfam tree file for PF01036… Done.
[java] Aligning the sequences of 1C3W and BACR_HALSA in PF01036… Done.
[java] Sequence filtering options (for coevolution score computation, plotting and analysis)
[java] Maximum fraction of gaps per sequence: 1.0
[java] Maximum sequence similarity: 0.9
[java] Minimum number of sequences: 50
[java] Maximum number of sequences: 500
[java] Site filtering options (for coevolution score plotting and analysis)
[java] Maximum fraction of gaps per site: 0.1
[java] Maximum fraction of sequences having the same character: 1.0
[java] Site filtering options specific to intra-protein analysis
[java] Minimum site separation: 3
[java] Maximum fraction of sequences having connected gaps at a site pair: 0.1
[java] Performing sequence filtering of the MSA of PF01036 from Pfam…
[java]
[java]
[java] java.lang.IllegalArgumentException: Node [A0A1S9DF11_ASPOZ/53-285] cannot be found in the tree.
[java] at org.gersteinlab.coevolution.core.data.NewickTree.removeNode(NewickTree.java:112)
[java] at org.gersteinlab.coevolution.core.tasks.MsaFilter.filterSequences(MsaFilter.java:214)
[java] at org.gersteinlab.coevolution.intra.Main.start(Main.java:391)
[java] at org.gersteinlab.coevolution.intra.Main.main(Main.java:754)
A1:
Please check whether the tree file can be downloaded. If not, I think the problem can be easily fixed by changing the URL in the fourth line. This is likely caused by a change of the Pfam web site.
If the file can be downloaded but the error still exists, then it is likely more related to the ID of each sequence in the different file. Please check whether the ID "A0A1S9DF11_ASPOZ/53-285" can be found in the tree file.
Q2:
I have check the tree file and the ID "A0A1S9DF11_ASPOZ/53-285" dose not exist. Is the tree file right?
data/intra/PF01036.tree:
(((BACS2_HALSA/5-220:0.72860,(BACS2_HALMA/5-224:0.61512,BACS2_NATPH/5-223:0.59067)0.650:0.09815)0.820:0.08903,(C7P1Y4_HALMD/5-221:1.18066,D3SUL9_NATMM/5-219:1.46984)0.970:0.58374)0.700:0.08760,(BACH_NATPH/35-274:1.24503,(BACR_HALAR/8-238:0.28604,(BACR_HALSA/23-247:0.30384,BACR1_HALC1/22-246:0.37112)0.960:0.20210)0.830:0.21189)0.960:0.32269,(B6BSG6_9PROT/34-253:1.66008,((B5RTR5_DEBHA/38-284:0.28614,(A3LUH9_PICST/37-279:0.24739,(C4YF64_CANAW/43-284:0.53146,B9W6Y7_CANDC/40-281:0.13058)1.000:0.31032)0.820:0.12498)0.910:0.32193,(C5E3Q5_LACTC/36-281:0.56486,C5DYF7_ZYGRC/38-283:0.47362)0.810:0.26951)1.000:1.78728)0.700:0.15130);
A2:
Then the MSA file and tree file from Pfam do not match. I am not sure why it happens. Maybe one is seed alignment and the other is full alignment?
A3:
Yes, you are right. The tree file is seed alignment but the MSA file is full alignment. There is only one URL for the tree file in the pfam web site (https://pfam.xfam.org/family/PF01036#tabview=tab5).
If both files are seed alignment, then the project will give exception "Not enough sequences". The default value of minSeqCount in intra.config file is 50 but only 16 sequences are left.
I think about two ways to fix the problem. One is setting smaller value of minSeqCount. The other one is localizing the method of generating the tree file (FastTree) and then generate full alignment formatted tree file.
Actually, I don’t know the difference of full alignment and seed alignment for the project.
If you think the 16 seed sequences are enough, you can bypass the minimum threshold. On the web site, you can find that option by clicking the "Show advanced options" link at the bottom.
But if you need more sequences, then either you produce the tree by yourself, or use a method that does not require the tree.