Coevolution source code from download link is out of date

Q1:
As I was trying to run coevolution locally, the downloaded source code (http://coevolution.gersteinlab.org/coevolution/dist/coevolution.jar) was out of date (at least URLs of pfam and rcsb pdb). Although I corrected those URLs and recompiled the project, there were still some bugs as shown below. I am now struggling to fix them. Meanwhile, could you please help me for the problem? I will appreciate very much.

Buildfile: /home/xiety/software/coevolutiontool/build.xml

intra:
[java] Protein list: intraProteins.txt
[java] Data directory: data/intra
[java] Result directory: results/intra
[java] Download MSAs? true
[java] Download structures? true
[java] Compute residue distances? false
[java] Align PDB and MSA sequences? true
[java] Compute coevolution scores? true
[java] Compute shuffled coevolution scores? false
[java] Plot coevolution scores? false
[java] Analyze coevolution scores? true
[java] Terminate execution on error? false
[java] Alignment methods: [Pfam]
[java] Downloading PDB file for 1C3W… Done.
[java] Downloading Pfam MSA file for PF01036… Done.
[java] Downloading Pfam tree file for PF01036… Done.
[java] Aligning the sequences of 1C3W and BACR_HALSA in PF01036… Done.
[java] Sequence filtering options (for coevolution score computation, plotting and analysis)
[java] Maximum fraction of gaps per sequence: 1.0
[java] Maximum sequence similarity: 0.9
[java] Minimum number of sequences: 50
[java] Maximum number of sequences: 500
[java] Site filtering options (for coevolution score plotting and analysis)
[java] Maximum fraction of gaps per site: 0.1
[java] Maximum fraction of sequences having the same character: 1.0
[java] Site filtering options specific to intra-protein analysis
[java] Minimum site separation: 3
[java] Maximum fraction of sequences having connected gaps at a site pair: 0.1
[java]
[java]
[java] org.gersteinlab.coevolution.core.data.DataFormatException: Cannot find the separator between the ID and the positions.
[java] at org.gersteinlab.coevolution.core.data.PfamFormatUtil.parseId(PfamFormatUtil.java:34)
[java] at org.gersteinlab.coevolution.core.data.PfamFastaProteinSequence.<init>(PfamFastaProteinSequence.java:61)
[java] at org.gersteinlab.coevolution.core.io.PfamFastaProteinMsaReader.readNextSequence(PfamFastaProteinMsaReader.java:55)
[java] at org.gersteinlab.coevolution.core.io.MsaReader.readMsa(MsaReader.java:75)
[java] at org.gersteinlab.coevolution.core.tasks.MsaFilter.init(MsaFilter.java:128)
[java] at org.gersteinlab.coevolution.intra.Main.start(Main.java:387)
[java] at org.gersteinlab.coevolution.intra.Main.main(Main.java:754)

Also — It seems that the treeURL is not correct.
All necessary URLs are modified as below:

URL pdbURL = new URL("https://files.rcsb.org/download/" + pdbID.toUpperCase() + ".pdb.gz");
URL msaURL = new URL("http://pfam.xfam.org/family/alignment/download/format?alnType=" + (seedOnly ?"seed" :"full") + "&format=fasta&order=t&gaps=default&download=downloadD&acc=" + pfamID.toUpperCase());
URL msaURL = new URL("https://pfam.xfam.org/family/" + pfamID.toUpperCase() + "/alignment/" + (seedOnly ?"seed" :"full") + "/gzipped");
URL treeURL = new URL("https://pfam.xfam.org/family/" + pfamID.toUpperCase() + "/tree/download");

Updated Exceptions:
Buildfile: /home/xiety/software/coevolutiontool/build.xml

intra:
[java] Protein list: intraProteins.txt
[java] Data directory: data/intra
[java] Result directory: results/intra
[java] Download MSAs? true
[java] Download structures? true
[java] Compute residue distances? false
[java] Align PDB and MSA sequences? true
[java] Compute coevolution scores? true
[java] Compute shuffled coevolution scores? false
[java] Plot coevolution scores? false
[java] Analyze coevolution scores? true
[java] Terminate execution on error? false
[java] Alignment methods: [Pfam]
[java] Downloading PDB file for 1C3W… Done.
[java] Downloading Pfam MSA file for PF01036… Done.
[java] Downloading Pfam tree file for PF01036… Done.
[java] Aligning the sequences of 1C3W and BACR_HALSA in PF01036… Done.
[java] Sequence filtering options (for coevolution score computation, plotting and analysis)
[java] Maximum fraction of gaps per sequence: 1.0
[java] Maximum sequence similarity: 0.9
[java] Minimum number of sequences: 50
[java] Maximum number of sequences: 500
[java] Site filtering options (for coevolution score plotting and analysis)
[java] Maximum fraction of gaps per site: 0.1
[java] Maximum fraction of sequences having the same character: 1.0
[java] Site filtering options specific to intra-protein analysis
[java] Minimum site separation: 3
[java] Maximum fraction of sequences having connected gaps at a site pair: 0.1
[java] Performing sequence filtering of the MSA of PF01036 from Pfam…
[java]
[java]
[java] java.lang.IllegalArgumentException: Node [A0A1S9DF11_ASPOZ/53-285] cannot be found in the tree.
[java] at org.gersteinlab.coevolution.core.data.NewickTree.removeNode(NewickTree.java:112)
[java] at org.gersteinlab.coevolution.core.tasks.MsaFilter.filterSequences(MsaFilter.java:214)
[java] at org.gersteinlab.coevolution.intra.Main.start(Main.java:391)
[java] at org.gersteinlab.coevolution.intra.Main.main(Main.java:754)

A1:
Please check whether the tree file can be downloaded. If not, I think the problem can be easily fixed by changing the URL in the fourth line. This is likely caused by a change of the Pfam web site.

If the file can be downloaded but the error still exists, then it is likely more related to the ID of each sequence in the different file. Please check whether the ID "A0A1S9DF11_ASPOZ/53-285" can be found in the tree file.

Q2:
I have check the tree file and the ID "A0A1S9DF11_ASPOZ/53-285" dose not exist. Is the tree file right?

data/intra/PF01036.tree:
(((BACS2_HALSA/5-220:0.72860,(BACS2_HALMA/5-224:0.61512,BACS2_NATPH/5-223:0.59067)0.650:0.09815)0.820:0.08903,(C7P1Y4_HALMD/5-221:1.18066,D3SUL9_NATMM/5-219:1.46984)0.970:0.58374)0.700:0.08760,(BACH_NATPH/35-274:1.24503,(BACR_HALAR/8-238:0.28604,(BACR_HALSA/23-247:0.30384,BACR1_HALC1/22-246:0.37112)0.960:0.20210)0.830:0.21189)0.960:0.32269,(B6BSG6_9PROT/34-253:1.66008,((B5RTR5_DEBHA/38-284:0.28614,(A3LUH9_PICST/37-279:0.24739,(C4YF64_CANAW/43-284:0.53146,B9W6Y7_CANDC/40-281:0.13058)1.000:0.31032)0.820:0.12498)0.910:0.32193,(C5E3Q5_LACTC/36-281:0.56486,C5DYF7_ZYGRC/38-283:0.47362)0.810:0.26951)1.000:1.78728)0.700:0.15130);

A2:
Then the MSA file and tree file from Pfam do not match. I am not sure why it happens. Maybe one is seed alignment and the other is full alignment?

A3:
Yes, you are right. The tree file is seed alignment but the MSA file is full alignment. There is only one URL for the tree file in the pfam web site (https://pfam.xfam.org/family/PF01036#tabview=tab5).

If both files are seed alignment, then the project will give exception "Not enough sequences". The default value of minSeqCount in intra.config file is 50 but only 16 sequences are left.
I think about two ways to fix the problem. One is setting smaller value of minSeqCount. The other one is localizing the method of generating the tree file (FastTree) and then generate full alignment formatted tree file.
Actually, I don’t know the difference of full alignment and seed alignment for the project.

If you think the 16 seed sequences are enough, you can bypass the minimum threshold. On the web site, you can find that option by clicking the "Show advanced options" link at the bottom.

But if you need more sequences, then either you produce the tree by yourself, or use a method that does not require the tree.

Request re pseudogene.org

Q:
We develop MH Guide, a genome-guided cancer treatment decision support software (https://www.molecularhealth.com/us/).

I was trying to get the current annotated pseudogene information via http://pseudogene.org/Human/. The link to GENCODE seems not to work and returns “file not found” (https://www.gencodegenes.org/releases/current.html).

Could you please kindly redirect me to the file with annotated pseudogenes.

A:
If you are looking for the current GENCODE annotation, for the current release please follow this link: https://www.gencodegenes.org/human/ . If you want to use pseudo pipe to create a custom human annotation of a genome sequence of preferences, please follow the instructions here: http://pseudogene.org/pseudopipe/ . If you are interested in the functional annotation of pseudogenes with information regarding pseudogene activity please see http://pseudogene.org/psicube/ .

Trying to execute Pseudopipe but I am running into multiple errors

Q1:
I am trying to execute Pseudopipe but I am running into multiple errors. I have downloaded the latest version from the website and the command I am using to run it is

/work/LAS/rpwise-lab/sagnik/finder/lib/pgenes/pseudopipe/bin/pseudopipe.sh /work/LAS/rpwise-lab/sagnik/finder/lib/pgenes/ppipe_output/caenorhabditis_elegans_62_220a /work/LAS/rpwise-lab/sagnik/finder/lib/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/dna_rm.fa /work/LAS/rpwise-lab/sagnik/finder/lib/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/Caenorhabditis_elegans.WS220.62.dna.chromosome.%s.fa /work/LAS/rpwise-lab/sagnik/finder/lib/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/pep/Caenorhabditis_elegans.WS220.62.pep.fa /work/LAS/rpwise-lab/sagnik/finder/lib/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/mysql/chrI_exLocs 0

I keep getting the following errors:

Making directories
Copying sequences
Fomatting the DNAs
/work/LAS/rpwise-lab/sagnik/finder/lib/pgenes/pseudopipe/bin/pseudopipe.sh: line 84: /home/bp272/bin/blast-2.2.13/bin/formatdb: No such file or directory
Preparing the blast jobs
Skipping blast
Processing blast output
/work/LAS/rpwise-lab/sagnik/finder/lib/pgenes/pseudopipe/bin/pseudopipe.sh: line 114: /home/bp272/bin/Python-2.6.6/python: No such file or directory
Finished processing blast output
Running Pseudopipe on both strands
Working on M strand
/work/LAS/rpwise-lab/sagnik/finder/lib/pgenes/pseudopipe/bin/pseudopipe.sh: line 144: /home/bp272/bin/Python-2.6.6/python: No such file or directory
Finished Pseudopipe on strand M
Working on P strand
/work/LAS/rpwise-lab/sagnik/finder/lib/pgenes/pseudopipe/bin/pseudopipe.sh: line 144: /home/bp272/bin/Python-2.6.6/python: No such file or directory
Finished Pseudopipe on strand P
Generating final results
find: ‘/work/LAS/rpwise-lab/sagnik/finder/lib/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/minus/pgenes’: No such file or directory
find: ‘/work/LAS/rpwise-lab/sagnik/finder/lib/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/plus/pgenes’: No such file or directory
gzip: /work/LAS/rpwise-lab/sagnik/finder/lib/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/*/pgenes/*.all.fa: No such file or directory
Finished generating pgene full alignment
Finished running Pseudopipe

I am running this inside conda environment. I tried executing it outside but it gave me the same errors. Could you please help?

I have posted on the website under the comments section by mistake. Please excuse my ignorance.

A1:
It seems that you did not set the environment file (env.sh) correctly. You may need to set the it as the following and put in the same dir as fetchEnsemblFiles.py & processEnsemblFiles.sh

###
#!/bin/sh
if [ ! -z "$PSEUDOPIPE_ENV" ]; then source $PSEUDOPIPE_ENV; return; fi

# Pseudopipe configuration

export PSEUDOPIPE_HOME=`cd \`dirname $0\`/../; pwd`

export pseudopipe=$PSEUDOPIPE_HOME/core/runScripts.py

export genPgeneResult=$PSEUDOPIPE_HOME/ext/genPgeneResult.sh

export genFullAln=$PSEUDOPIPE_HOME/ext/genFullAln.sh

export fastaSplitter=$PSEUDOPIPE_HOME/ext/splitFasta.py

export sqDedicated=$PSEUDOPIPE_HOME/ext/sqDedicated.py

export sqDummy=$PSEUDOPIPE_HOME/ext/sqDummy.py

export blastHandler=$PSEUDOPIPE_HOME/core/processBlastOutput.py

export extractExLoc=$PSEUDOPIPE_HOME/core/extractKPExonLocations-Aug2016.py # extractKPExonLocations-Jan2016.py

# Python configuration

export pythonExec=/bin/python2

# Alignment tools configuration

export formatDB=/ysm-gpfs/pi/gerstein/yy532/software/blast-2.2.13/bin/formatdb

export blastExec=/ysm-gpfs/pi/gerstein/yy532/software/blast-2.2.13/bin/blastall

export fastaExec=/ysm-gpfs/pi/gerstein/yy532/software/fasta-35.1.5/tfasty35

Q2:
Thank you for your reply. I am using python3 in my pipeline. Will this code work for python3?

A2:
Please use python2.

Small question of the paper “Passenger Mutations in More Than 2,500 Cancer Genomes: Overall Molecular Functional Impact and Consequences”

Q:
Recently, I read a paper which was published in Cell, titled "Passenger Mutations in More Than 2,500 Cancer Genomes: Overall Molecular Functional Impact and Consequences". Cause of my research topic was similar with this paper, just one of question about Figure 2B. In this heatmap, I saw totally 80 motifs on the bottom, but only 70 rows up to them, I was a little bit confused how did you know the ETS motif matched to the marked row?

A:
The rows in the figure correspond to different cancer cohorts or meta-cohorts. We also provide this information on the cancer cohort with significant differential burdening in Supplement 1 in the paper.

PCAWG passenger mutation analysis

Q1:
I was trying to download a subset of data from your recent paper (https://www.cell.com/cell/fulltext/S0092-8674(20)30113-6). However, the website is returning ‘not found’ error (http://pcawg.gersteinlab.org/). Especially, I am interested in ‘Gene list categories’. Therefore, I kindly request you to share relevant files listed under ‘Gene List Categories’ on the website, so I could use in my analysis.

A1:
The website works fine for me. Sure it doesn’t work ? … Please let me know which specific file are you trying to download.

Q2:
Thanks a lot for the reply.

I need the gene list categories listed under PCAWG-specific annotations (http://pcawg.gersteinlab.org/#Annotations)

Eseential Genes
Immune Response Genes
DNA repair Genes
Metabolic Genes
Cancer Pathway Genes
non-Essential Genes
cell Cycle Genes
For some reason, when I click on the link, it’s directly downloading the html file with error. It would be great if you could share these files.

A2:
You can download relevant files from the link listed below.

http://pcawg.gersteinlab.org/Datasets/Annotations/categories/

Question about the cQTL analysis in Wang et al 2018

Q:
I am writing with a question about the cQTL analysis in Wang et al 2018. Were the 292 individuals analyzed in this analysis all of European ancestry? If not, what were the sample sizes for European vs non-European ancestry, and how did you control for ancestry in your analysis?

I apologize for writing with such a detailed question, but I could not find the answer in the main text or supplement of the paper, or on the synapse website. (Context: I am interested in cross-population genetic analyses of psychiatric disease and wondering if PyschENCODE cQTL data is relevant.)

A:
In calculating the cQTLs, we used 173 Caucasians and 119 non-Caucasians. With respect to controlling for ancestry — we used the top three genotype principal components as covariates to control for ancestral group.

DTE results as described in the paper “Transcriptome-wide isoform-level dysregulation in ASD, schizophrenia, and bipolar disorder”

Q:
I was trying to reproduce the DTE results as described in the paper "Transcriptome-wide isoform-level dysregulation in ASD, schizophrenia, and bipolar disorder". I am a registered user of synapse but was unable to find the data mentioned below and would really appreaciate your help in obtaining the same.
The supplementary method of this paper mentions the different covariates used for carrying out DGE and DTE using the nlme package. Would it be possible to obtain the seqPCs and SV values, particulary seqPCs (1-3, 5-8, 10-14, 16, 18-25, 27-29) and SVs (1-4) used in the lme model?
Additionally, could I obtain the final list of sample IDs that made it to the DGE/DTE analysis?

A:
See the seqPCs we used in our analysis (attached)

Query regarding Pseudopipe

Q:
Since I am working on pseudogene identification for my new project, I was using your pipeline. But I am having few errors which I am going to mention below. Can you please help me to resolve these errors. I shall be very grateful to you.
>
> ERRORS:
> 1. On terminal:
> sudo bash pseudopipe.sh ~/pgenes/ppipe_output/caenorhabditis_elegans_62_220a ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/dna_rm.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/Caenorhabditis_elegans.WS220.62.dna.chromosome.%s.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/pep/Caenorhabditis_elegans.WS220.62.pep.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/mysql/chr%s_exLocs 0
> Making directories
> Copying sequences
> Fomatting the DNAs
> Preparing the blast jobs
> Skipping blast
> Processing blast output
> Skipping the processing of blast output
> Running Pseudopipe on both strands
> Working on M strand
> Finished Pseudopipe on strand M
> Working on P strand
> Finished Pseudopipe on strand P
> Generating final results
> find: ‘/home/kashmir/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/minus/pgenes’: No such file or directory
> find: ‘/home/kashmir/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/plus/pgenes’: No such file or directory
> gzip: /home/kashmir/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/*/pgenes/*.all.fa: No such file or directory
> Finished generating pgene full alignment
> Finished running Pseudopipe
> 2. In log file inside minus and plus folder:
> need to document overlap parameter (30) and dependency on mask array files.
> mask fields [2, 3]
> Traceback (most recent call last):
> File "/home/kashmir/SOFTWARE/pgenes/pseudopipe/core/filterEnsemblGene.py", line 60, in <module>
> maskFile = openOrFail(ExonMaskTemplate % chr, ‘r’)
> TypeError: not all arguments converted during string formatting
> running filterEnsemblGene.py
> failed during filterEnsemblGene.py stage.

A:
From the output it looks like you had a couple of issues starting with the blast job.

Could you please check your output directory in the blast/output folder and see if you see any split000*.Out files (where * is a number). If you don’t see any output files it means that your blast job did not run. In order run the pipeline you need to have a couple of additional software packages installed and preferentially added to the path. Specifically you will need: blast-2.2.13 and fasta-35.1.5. If you do not want to add them to the path, you can add the path to their location in the env.sh file that you can find in the bin folder of the PseudoPipe.

This should allow you to run the pipeline without any issues.

Question regarding list of human pseudogenes

Q:
I am … developing an application that matches cancer patients to treatment based on the person’s genetic profile. We are looking for an updated list of human pseudogenes to use in evaluating submitted DNA variants. Can you tell me if the Pseudo Fam data files at the pseudogen.org website are still being updated? If not, perhaps you could recommend an alternate source?

A:
Best to get an updated list of pseudogenes from pseudogene.org, which is continually updated, ie http://pseudogene.org/Human/. Yucheng