Hi-C contact matrix from PsychEncode

Q:
I am trying to explore Hi-C contact matrices at 10kb and 40kb resolution from here http://resource.psychencode.org/. However, I could not find any document showing the genomic coordinates of bins for both columns and rows (two matrices with the size of 303642 x 13554, 75919 x 2259).

A:
This issue i resolved the issue now. The decompression didn’t work properly so re-downloading the data solved the problem.

Full set of tQTLs and isoQTLs from Wang et al. 2018

Q:
we have made great use of the publicly available PEC resources on https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fresource.psychencode.org%2F&data=02%7C01%7Cshuang.liu%40yale.edu%7Caa9d9436ceb6478ec71208d8142dea62%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637281534720419443&sdata=eu4DyEY%2BNUJuueEbj4YWFeWfOYoao6j%2B%2F1rqyq1DSUc%3D&reserved=0, in particular the QTL data. However, I have not been able to locate the full set of isoQTLs and tQTLs without any p-value/FDR filtering, as is available for eQTLs. Is there somewhere I can access this easily? Or does access to the full set of tQTLs and isoQTLs require an application to Synapse?

A:
Currently we don’t provide access to the full set. The full set is very large and we need to discuss where we should share these data. I will let you know once we have any updates.

Data access in Psychencode repository

Q: Would you be able to point me to the repository where all these data are stored? I am looking into the Psychencode repository in synapse but it’s not clear if all the data presented in the publications are included in there and if so, are grouped into one folder? We are particularly interested in the bulk and scRNASeq for now.
https://www.synapse.org/#!Synapse:syn5553626

A:
We have recently created a portal for easier access to the data generated through the PEC. Please see the [SingleCellRNAseq study](https://psychencode.synapse.org/Explore/Studies/DetailsPage?study=syn7067037) which this data came from. Note the link under the study description for the single cell data used in Wang et al.,

Data access approvals are handled by the NIMH through the NIMH Repository and Genomics Resources. Instructions are on the study page. If you do not have access, and have questions about the process let me know.

List of 321 high confidence SCZ-associated genes from Wang et al. 2018

Q:
I read your excellent work in Wang et al. 2018, and am wondering whether you could kindly share the list of 321 high confidence SCZ-associated genes. We are studying SCZ iPSC-derived interneurons and this information would be helpful for us to understand which DE gene may be causal in our system.

A:
It should be at: http://resource.psychencode.org

Full set of tQTLs and isoQTLs from Wang et al. 2018

Q:
As a lab, our general interests lie in the intersection between transcriptomics, neurogenetics, and genetic diagnosis. As such, we have made great use of the publicly available PEC resources on https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fresource.psychencode.org%2F&data=02%7C01%7Cshuang.liu%40yale.edu%7Caa9d9436ceb6478ec71208d8142dea62%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637281534720419443&sdata=eu4DyEY%2BNUJuueEbj4YWFeWfOYoao6j%2B%2F1rqyq1DSUc%3D&reserved=0, in particular the QTL data. However, I have not been able to locate the full set of isoQTLs and tQTLs without any p-value/FDR filtering, as is available for eQTLs. Is there somewhere I can access this easily? Or does access to the full set of tQTLs and isoQTLs require an application to Synapse?

A:
Currently we don’t provide access to the full set. The full set is very large and we need to discuss where we should share these data. I will let you know once we have any updates.

Questions regarding eqtl calls

Q:
I am trying to reproduce the eQTL calls published here with file name: Full_hg19_cis-eQTL. I’m having some difficulty reproducing the eQTL calls and in particular the P-values, and wanted to figure out where my pipeline isn’t matching.

1) I am unsure of the earth selection process on the super covariates sets. Currently, we try to reproduce the covariates selection using one hot matrix encoded covariates superset mentioned in the supplementary material (page 7) of this publication . We are curious on what covariates are selected (e.g.: brain bank covariates include multiple institutes, are all of them selected, or just some of them?).

2) We are unsure on which GTEx pipeline for EQTL calls were employed by the publication. We are currently using the GTEx pipeline mentioned here, but am wondering if the paper uses an older version of the GTEx pipeline that was previously available?

3) Another question is which datasets are fed into the eqtl calls? We are currently working with the capstone genotype datasets and TPM expression matrix published here with file name: DER-02_PEC_Gene_expression_matrix_TPM. We are wondering if the Genotype/Expression filtering were done directly on these files?

4) The last question is when we call eqtl using FastQTL, the nominal p-values (that have passed FDR < 0.05) are much larger compared to the p values your study published here with the file name: DER-08a_hg19_eQTL.significant (so it looks like we’re incredibly underpowered). I’ve attached a figure to illustrate the nominal p values reported in your files versus computed by us. We have used the Capstone genotypes and expression files (as described above), and though we should be somewhat underpowered relative to your study (because we are missing the GTEx genotypes/expression files, which need separate agreements), I’m not sure it accounts for the difference in p value magnitudes. I was wondering if you have any thoughts on which part of the pipelines we may have implemented incorrectly that could lead to such a huge difference?

A:
Here are some responses to your questions.

I am unsure of the earth selection process on the super covariates sets. Currently, we try to reproduce the covariates selection using one hot matrix encoded covariates superset mentioned in the supplementary material (page 7) of this publication . We are curious on what covariates are selected (e.g.: brain bank covariates include multiple institutes, are all of them selected, or just some of them?).
Here are the covariates we are using, you can also find the description in supplemental materials in our paper (http://papers.gersteinlab.org/papers/capstone4/index.html):

Top 3 genotyping principal components
Probabilistic Estimation of Expression Residuals (PEER) factors
Genotyping array platform
Gender
Disease status

We are unsure on which GTEx pipeline for EQTL calls were employed by the publication. We are currently using the GTEx pipeline mentioned here, but am wondering if the paper uses an older version of the GTEx pipeline that was previously available?
The detailed description of our eQTL pipeline could be found in Fig. S31 in our paper http://papers.gersteinlab.org/papers/capstone4/index.html.

Another question is which datasets are fed into the eqtl calls? We are currently working with the capstone genotype datasets and TPM expression matrix published here with file name: DER-02_PEC_Gene_expression_matrix_TPM. We are wondering if the Genotype/Expression filtering were done directly on these files?
You can find details in Fig. S31 in our paper http://papers.gersteinlab.org/papers/capstone4/index.html.

The last question is when we call eqtl using FastQTL, the nominal p-values (that have passed FDR < 0.05) are much larger compared to the p values your study published here with the file name: DER-08a_hg19_eQTL.significant (so it looks like we’re incredibly underpowered). I’ve attached a figure to illustrate the nominal p values reported in your files versus computed by us. We have used the Capstone genotypes and expression files (as described above), and though we should be somewhat underpowered relative to your study (because we are missing the GTEx genotypes/expression files, which need separate agreements), I’m not sure it accounts for the difference in p value magnitudes. I was wondering if you have any thoughts on which part of the pipelines we may have implemented incorrectly that could lead to such a huge difference?
I am not sure which genotype file you are using. But we cannot share the merged genotype file since we integrated some GTEx samples in the file. We are also using different covariates. So your results will be different from ours if the genotype, phenotype and covariates inputs are not the same.

Coordinates of TADs and enhancer-promoters pairs from the PsychEncode dataser

Q:
I am developing a pipeline to analyze the Hi-C data from the PsychEncode project. As a sanity check, I want to map the enhancer-Transcription start sites (TSS) pairs from the file http://resource.psychencode.org/Datasets/Integrative/INT-16_HiC_EP_linkages.csv to the TADs inferred by the Psychencode project in the file http://resource.psychencode.org/Datasets/Derived/DER-18_TAD_adultbrain.bed.

Looking at the enhancer and TSS, the TSS have very "round" coordinates (e.g. 90000, 630000, etc). Just to confirm, those are still genomic coordinates, right?

Also, are the coordinates of the TADs genomic coordinates, or Hi-C bins? I assumed that was the case, but could not find any of the enhancer-TSS pairs in the same TAD, which is what I expected.

A:
RE your questions:
Looking at the enhancer and TSS, the TSS have very "round" coordinates (e.g. 90000, 630000, etc). Just to confirm, those are still genomic coordinates, right?
-> Yes. I used the resolution for Hi-C (in 10kb resolution), not the actual TSS. So you can simply overlap the TSS coordinates with the actual promoter coordinates to link genes to enhancers.

Also, are the coordinates of the TADs genomic coordinates, or Hi-C bins? I assumed that was the case, but could not find any of the enhancer-TSS pairs in the same TAD, which is what I expected.
-> TAD coordinates should also be the genomic coordinates, not Hi-C bins. It’s odd that you didn’t find enhancer-TSS pairs in the same TAD because we found >70% of E-P links are located within TADs..

Question about the cQTL analysis in Wang et al 2018

Q:
I am writing with a question about the cQTL analysis in Wang et al 2018. Were the 292 individuals analyzed in this analysis all of European ancestry? If not, what were the sample sizes for European vs non-European ancestry, and how did you control for ancestry in your analysis?

I apologize for writing with such a detailed question, but I could not find the answer in the main text or supplement of the paper, or on the synapse website. (Context: I am interested in cross-population genetic analyses of psychiatric disease and wondering if PyschENCODE cQTL data is relevant.)

A:
In calculating the cQTLs, we used 173 Caucasians and 119 non-Caucasians. With respect to controlling for ancestry — we used the top three genotype principal components as covariates to control for ancestral group.

DTE results as described in the paper “Transcriptome-wide isoform-level dysregulation in ASD, schizophrenia, and bipolar disorder”

Q:
I was trying to reproduce the DTE results as described in the paper "Transcriptome-wide isoform-level dysregulation in ASD, schizophrenia, and bipolar disorder". I am a registered user of synapse but was unable to find the data mentioned below and would really appreaciate your help in obtaining the same.
The supplementary method of this paper mentions the different covariates used for carrying out DGE and DTE using the nlme package. Would it be possible to obtain the seqPCs and SV values, particulary seqPCs (1-3, 5-8, 10-14, 16, 18-25, 27-29) and SVs (1-4) used in the lme model?
Additionally, could I obtain the final list of sample IDs that made it to the DGE/DTE analysis?

A:
See the seqPCs we used in our analysis (attached)

Inquiry regarding PsychENCODE Datasets

Q:
We are trying to replicate some results using the bulk RNA-seq datasets available from the PsychENCODE consortium. We currently have access to the transcript RSEM count data from reads aligned to hg19. We were wondering if the same data was available for reads aligned to hg38 and if so, how we could access that data?

A:
Sorry, we currently don’t have the transcript RSEM count data from reads aligned to hg38.