List of 321 high confidence SCZ-associated genes from Wang et al. 2018

Posted on July 28, 2020 by gersteinfaq

Q:
I read your excellent work in Wang et al. 2018, and am wondering whether you could kindly share the list of 321 high confidence SCZ-associated genes. We are studying SCZ iPSC-derived interneurons and this information would be helpful for us to understand which DE gene may be causal in our system.

A:
It should be at: http://resource.psychencode.org

Request for data from Zhang and Gerstein NAR (2003)

Posted on July 28, 2020 by gersteinfaq

Q:
I recently came across your paper, "Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes."

I’m interested in the substitution rates in human pseudogenes. Figure 2A from your paper (pasted below) plots these rates. Would you be able to send me these rates as a table?

Additionally, has your group calculated the substitution rates for more families of pseudogenes? (The NAR 2003 paper only analyzed ribosomal protein pseudogenes sequences.) I tried poking around psiDR, but wasn’t not able to find this type of information readily available.

These substitution rate matrices would be very helpful for my research.

A:
see: http://www.pseudogene.org/indel-nar
(via http://papers.gersteinlab.org/papers/indel-nar)

Full set of tQTLs and isoQTLs from Wang et al. 2018

Posted on July 28, 2020 by gersteinfaq

Q:
As a lab, our general interests lie in the intersection between transcriptomics, neurogenetics, and genetic diagnosis. As such, we have made great use of the publicly available PEC resources on https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fresource.psychencode.org%2F&data=02%7C01%7Cshuang.liu%40yale.edu%7Caa9d9436ceb6478ec71208d8142dea62%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637281534720419443&sdata=eu4DyEY%2BNUJuueEbj4YWFeWfOYoao6j%2B%2F1rqyq1DSUc%3D&reserved=0, in particular the QTL data. However, I have not been able to locate the full set of isoQTLs and tQTLs without any p-value/FDR filtering, as is available for eQTLs. Is there somewhere I can access this easily? Or does access to the full set of tQTLs and isoQTLs require an application to Synapse?

A:
Currently we don’t provide access to the full set. The full set is very large and we need to discuss where we should share these data. I will let you know once we have any updates.

Questions regarding eqtl calls

Posted on July 28, 2020 by gersteinfaq

Q:
I am trying to reproduce the eQTL calls published here with file name: Full_hg19_cis-eQTL. I’m having some difficulty reproducing the eQTL calls and in particular the P-values, and wanted to figure out where my pipeline isn’t matching.

1) I am unsure of the earth selection process on the super covariates sets. Currently, we try to reproduce the covariates selection using one hot matrix encoded covariates superset mentioned in the supplementary material (page 7) of this publication . We are curious on what covariates are selected (e.g.: brain bank covariates include multiple institutes, are all of them selected, or just some of them?).

2) We are unsure on which GTEx pipeline for EQTL calls were employed by the publication. We are currently using the GTEx pipeline mentioned here, but am wondering if the paper uses an older version of the GTEx pipeline that was previously available?

3) Another question is which datasets are fed into the eqtl calls? We are currently working with the capstone genotype datasets and TPM expression matrix published here with file name: DER-02_PEC_Gene_expression_matrix_TPM. We are wondering if the Genotype/Expression filtering were done directly on these files?

4) The last question is when we call eqtl using FastQTL, the nominal p-values (that have passed FDR < 0.05) are much larger compared to the p values your study published here with the file name: DER-08a_hg19_eQTL.significant (so it looks like we’re incredibly underpowered). I’ve attached a figure to illustrate the nominal p values reported in your files versus computed by us. We have used the Capstone genotypes and expression files (as described above), and though we should be somewhat underpowered relative to your study (because we are missing the GTEx genotypes/expression files, which need separate agreements), I’m not sure it accounts for the difference in p value magnitudes. I was wondering if you have any thoughts on which part of the pipelines we may have implemented incorrectly that could lead to such a huge difference?

A:
Here are some responses to your questions.

I am unsure of the earth selection process on the super covariates sets. Currently, we try to reproduce the covariates selection using one hot matrix encoded covariates superset mentioned in the supplementary material (page 7) of this publication . We are curious on what covariates are selected (e.g.: brain bank covariates include multiple institutes, are all of them selected, or just some of them?).
Here are the covariates we are using, you can also find the description in supplemental materials in our paper (http://papers.gersteinlab.org/papers/capstone4/index.html):

Top 3 genotyping principal components
Probabilistic Estimation of Expression Residuals (PEER) factors
Genotyping array platform
Gender
Disease status

We are unsure on which GTEx pipeline for EQTL calls were employed by the publication. We are currently using the GTEx pipeline mentioned here, but am wondering if the paper uses an older version of the GTEx pipeline that was previously available?
The detailed description of our eQTL pipeline could be found in Fig. S31 in our paper http://papers.gersteinlab.org/papers/capstone4/index.html.

Another question is which datasets are fed into the eqtl calls? We are currently working with the capstone genotype datasets and TPM expression matrix published here with file name: DER-02_PEC_Gene_expression_matrix_TPM. We are wondering if the Genotype/Expression filtering were done directly on these files?
You can find details in Fig. S31 in our paper http://papers.gersteinlab.org/papers/capstone4/index.html.

The last question is when we call eqtl using FastQTL, the nominal p-values (that have passed FDR < 0.05) are much larger compared to the p values your study published here with the file name: DER-08a_hg19_eQTL.significant (so it looks like we’re incredibly underpowered). I’ve attached a figure to illustrate the nominal p values reported in your files versus computed by us. We have used the Capstone genotypes and expression files (as described above), and though we should be somewhat underpowered relative to your study (because we are missing the GTEx genotypes/expression files, which need separate agreements), I’m not sure it accounts for the difference in p value magnitudes. I was wondering if you have any thoughts on which part of the pipelines we may have implemented incorrectly that could lead to such a huge difference?
I am not sure which genotype file you are using. But we cannot share the merged genotype file since we integrated some GTEx samples in the file. We are also using different covariates. So your results will be different from ours if the genotype, phenotype and covariates inputs are not the same.

Yale Morph Server job not being completed

Posted on July 28, 2020 by gersteinfaq

Q:
I uploaded my PDB files on the multi-chain morph server, the job ID is b499364-832. It has been two days but the results page still says that the job is not yet complete. Is there any problem with my morph? I understand that my files are very large so it may take a long time to finish, but is it possible that it takes this long? I would really appreciate your assistance.

A:
Files generated by the multi-chain server can be found in
http://www.molmovdb.org/uploads/<job ID>/
Yours are in
http://www.molmovdb.org/uploads/b499364-832/

Looking at it, some files seem to be missing.
For e.g., a complete run would look like
http://www.molmovdb.org/uploads/b649592-16743/

There is a job running and it seems to be related to yours
"/usr/bin/perl ./multi.pl b499364-832 –chains=WCBAXHGFEDJLMNOPQRYAIS
–nframes=8 –email=seiga –engine=CNS –debug"

The short of it is that your job is probably stuck since you seem to
have submitted it 6 days ago judging from the submit time. Note that
we cannot guarantee the full functionality of this service as it is
from 2005 and has not been fully maintained since. Occasionally, we
may roll back but that would mean you will need to resubmit your job.

Coordinates of TADs and enhancer-promoters pairs from the PsychEncode dataser

Posted on July 20, 2020 by gersteinfaq

Q:
I am developing a pipeline to analyze the Hi-C data from the PsychEncode project. As a sanity check, I want to map the enhancer-Transcription start sites (TSS) pairs from the file http://resource.psychencode.org/Datasets/Integrative/INT-16_HiC_EP_linkages.csv to the TADs inferred by the Psychencode project in the file http://resource.psychencode.org/Datasets/Derived/DER-18_TAD_adultbrain.bed.

Looking at the enhancer and TSS, the TSS have very "round" coordinates (e.g. 90000, 630000, etc). Just to confirm, those are still genomic coordinates, right?

Also, are the coordinates of the TADs genomic coordinates, or Hi-C bins? I assumed that was the case, but could not find any of the enhancer-TSS pairs in the same TAD, which is what I expected.

A:
RE your questions:
Looking at the enhancer and TSS, the TSS have very "round" coordinates (e.g. 90000, 630000, etc). Just to confirm, those are still genomic coordinates, right?
-> Yes. I used the resolution for Hi-C (in 10kb resolution), not the actual TSS. So you can simply overlap the TSS coordinates with the actual promoter coordinates to link genes to enhancers.

Also, are the coordinates of the TADs genomic coordinates, or Hi-C bins? I assumed that was the case, but could not find any of the enhancer-TSS pairs in the same TAD, which is what I expected.
-> TAD coordinates should also be the genomic coordinates, not Hi-C bins. It’s odd that you didn’t find enhancer-TSS pairs in the same TAD because we found >70% of E-P links are located within TADs..

Coevolution source code from download link is out of date

Posted on July 20, 2020 by gersteinfaq

Q1:
As I was trying to run coevolution locally, the downloaded source code (http://coevolution.gersteinlab.org/coevolution/dist/coevolution.jar) was out of date (at least URLs of pfam and rcsb pdb). Although I corrected those URLs and recompiled the project, there were still some bugs as shown below. I am now struggling to fix them. Meanwhile, could you please help me for the problem? I will appreciate very much.

Buildfile: /home/xiety/software/coevolutiontool/build.xml

intra:
[java] Protein list: intraProteins.txt
[java] Data directory: data/intra
[java] Result directory: results/intra
[java] Download MSAs? true
[java] Download structures? true
[java] Compute residue distances? false
[java] Align PDB and MSA sequences? true
[java] Compute coevolution scores? true
[java] Compute shuffled coevolution scores? false
[java] Plot coevolution scores? false
[java] Analyze coevolution scores? true
[java] Terminate execution on error? false
[java] Alignment methods: [Pfam]
[java] Downloading PDB file for 1C3W… Done.
[java] Downloading Pfam MSA file for PF01036… Done.
[java] Downloading Pfam tree file for PF01036… Done.
[java] Aligning the sequences of 1C3W and BACR_HALSA in PF01036… Done.
[java] Sequence filtering options (for coevolution score computation, plotting and analysis)
[java] Maximum fraction of gaps per sequence: 1.0
[java] Maximum sequence similarity: 0.9
[java] Minimum number of sequences: 50
[java] Maximum number of sequences: 500
[java] Site filtering options (for coevolution score plotting and analysis)
[java] Maximum fraction of gaps per site: 0.1
[java] Maximum fraction of sequences having the same character: 1.0
[java] Site filtering options specific to intra-protein analysis
[java] Minimum site separation: 3
[java] Maximum fraction of sequences having connected gaps at a site pair: 0.1
[java]
[java]
[java] org.gersteinlab.coevolution.core.data.DataFormatException: Cannot find the separator between the ID and the positions.
[java] at org.gersteinlab.coevolution.core.data.PfamFormatUtil.parseId(PfamFormatUtil.java:34)
[java] at org.gersteinlab.coevolution.core.data.PfamFastaProteinSequence.<init>(PfamFastaProteinSequence.java:61)
[java] at org.gersteinlab.coevolution.core.io.PfamFastaProteinMsaReader.readNextSequence(PfamFastaProteinMsaReader.java:55)
[java] at org.gersteinlab.coevolution.core.io.MsaReader.readMsa(MsaReader.java:75)
[java] at org.gersteinlab.coevolution.core.tasks.MsaFilter.init(MsaFilter.java:128)
[java] at org.gersteinlab.coevolution.intra.Main.start(Main.java:387)
[java] at org.gersteinlab.coevolution.intra.Main.main(Main.java:754)

Also — It seems that the treeURL is not correct.
All necessary URLs are modified as below:

URL pdbURL = new URL("https://files.rcsb.org/download/" + pdbID.toUpperCase() + ".pdb.gz");
URL msaURL = new URL("http://pfam.xfam.org/family/alignment/download/format?alnType=" + (seedOnly ?"seed" :"full") + "&format=fasta&order=t&gaps=default&download=downloadD&acc=" + pfamID.toUpperCase());
URL msaURL = new URL("https://pfam.xfam.org/family/" + pfamID.toUpperCase() + "/alignment/" + (seedOnly ?"seed" :"full") + "/gzipped");
URL treeURL = new URL("https://pfam.xfam.org/family/" + pfamID.toUpperCase() + "/tree/download");

Updated Exceptions:
Buildfile: /home/xiety/software/coevolutiontool/build.xml

intra:
[java] Protein list: intraProteins.txt
[java] Data directory: data/intra
[java] Result directory: results/intra
[java] Download MSAs? true
[java] Download structures? true
[java] Compute residue distances? false
[java] Align PDB and MSA sequences? true
[java] Compute coevolution scores? true
[java] Compute shuffled coevolution scores? false
[java] Plot coevolution scores? false
[java] Analyze coevolution scores? true
[java] Terminate execution on error? false
[java] Alignment methods: [Pfam]
[java] Downloading PDB file for 1C3W… Done.
[java] Downloading Pfam MSA file for PF01036… Done.
[java] Downloading Pfam tree file for PF01036… Done.
[java] Aligning the sequences of 1C3W and BACR_HALSA in PF01036… Done.
[java] Sequence filtering options (for coevolution score computation, plotting and analysis)
[java] Maximum fraction of gaps per sequence: 1.0
[java] Maximum sequence similarity: 0.9
[java] Minimum number of sequences: 50
[java] Maximum number of sequences: 500
[java] Site filtering options (for coevolution score plotting and analysis)
[java] Maximum fraction of gaps per site: 0.1
[java] Maximum fraction of sequences having the same character: 1.0
[java] Site filtering options specific to intra-protein analysis
[java] Minimum site separation: 3
[java] Maximum fraction of sequences having connected gaps at a site pair: 0.1
[java] Performing sequence filtering of the MSA of PF01036 from Pfam…
[java]
[java]
[java] java.lang.IllegalArgumentException: Node [A0A1S9DF11_ASPOZ/53-285] cannot be found in the tree.
[java] at org.gersteinlab.coevolution.core.data.NewickTree.removeNode(NewickTree.java:112)
[java] at org.gersteinlab.coevolution.core.tasks.MsaFilter.filterSequences(MsaFilter.java:214)
[java] at org.gersteinlab.coevolution.intra.Main.start(Main.java:391)
[java] at org.gersteinlab.coevolution.intra.Main.main(Main.java:754)

A1:
Please check whether the tree file can be downloaded. If not, I think the problem can be easily fixed by changing the URL in the fourth line. This is likely caused by a change of the Pfam web site.

If the file can be downloaded but the error still exists, then it is likely more related to the ID of each sequence in the different file. Please check whether the ID "A0A1S9DF11_ASPOZ/53-285" can be found in the tree file.

Q2:
I have check the tree file and the ID "A0A1S9DF11_ASPOZ/53-285" dose not exist. Is the tree file right?

data/intra/PF01036.tree:
(((BACS2_HALSA/5-220:0.72860,(BACS2_HALMA/5-224:0.61512,BACS2_NATPH/5-223:0.59067)0.650:0.09815)0.820:0.08903,(C7P1Y4_HALMD/5-221:1.18066,D3SUL9_NATMM/5-219:1.46984)0.970:0.58374)0.700:0.08760,(BACH_NATPH/35-274:1.24503,(BACR_HALAR/8-238:0.28604,(BACR_HALSA/23-247:0.30384,BACR1_HALC1/22-246:0.37112)0.960:0.20210)0.830:0.21189)0.960:0.32269,(B6BSG6_9PROT/34-253:1.66008,((B5RTR5_DEBHA/38-284:0.28614,(A3LUH9_PICST/37-279:0.24739,(C4YF64_CANAW/43-284:0.53146,B9W6Y7_CANDC/40-281:0.13058)1.000:0.31032)0.820:0.12498)0.910:0.32193,(C5E3Q5_LACTC/36-281:0.56486,C5DYF7_ZYGRC/38-283:0.47362)0.810:0.26951)1.000:1.78728)0.700:0.15130);

A2:
Then the MSA file and tree file from Pfam do not match. I am not sure why it happens. Maybe one is seed alignment and the other is full alignment?

A3:
Yes, you are right. The tree file is seed alignment but the MSA file is full alignment. There is only one URL for the tree file in the pfam web site (https://pfam.xfam.org/family/PF01036#tabview=tab5).

If both files are seed alignment, then the project will give exception "Not enough sequences". The default value of minSeqCount in intra.config file is 50 but only 16 sequences are left.
I think about two ways to fix the problem. One is setting smaller value of minSeqCount. The other one is localizing the method of generating the tree file (FastTree) and then generate full alignment formatted tree file.
Actually, I don’t know the difference of full alignment and seed alignment for the project.

If you think the 16 seed sequences are enough, you can bypass the minimum threshold. On the web site, you can find that option by clicking the "Show advanced options" link at the bottom.

But if you need more sequences, then either you produce the tree by yourself, or use a method that does not require the tree.

Gerstein Lab FAQs

Frequently Asked Questions

Monthly Archives: July 2020

List of 321 high confidence SCZ-associated genes from Wang et al. 2018

Request for data from Zhang and Gerstein NAR (2003)

Full set of tQTLs and isoQTLs from Wang et al. 2018

Questions regarding eqtl calls

Yale Morph Server job not being completed

Coordinates of TADs and enhancer-promoters pairs from the PsychEncode dataser

Coevolution source code from download link is out of date