Regulatory Genetic network AND DSPN

Q:
I am studying your publication in Science (Comprehensive functional genomic resource and integrative model for the human brain, Science 362,1266(2018) with great interest. As a quantitative geneticist, I found it very relevant to the study of complex genetic traits. Therefore, I am writing this note to request your assistance inorder get your software/algorithm for Regulatory Genetic Network modeling and Integrative deep learning model (DSPN) so that we could implement them at NIH supercomputer system and conduct some integrative genomic modeling work in the area of brain/neuropsychiatry.

A:
Best to see resource.psychencode.org. Specifically — you can find the matlab codes "7. Matlab code and formatted data for
the DSPN" on http://resource.psychencode.org/

Question regarding RNA-seq data uploaded to “Synapse”

Q:
I was referred to you by Micheal Gandal for a question I have regarding you RNA-seq data from the fascinating shared article "Transcriptome-wide isoform-level dysregulation in ASD, schizophrenia, and bipolar disorder"

I know you’ve uploaded the TPM data to to PsychEncode website – could you tell me if the data this file is normalized DER-02_PEC_Gene_expression_matrix_TPM

A:
We didn’t run any quantile normalization on this file.

MS data in the Psychencode datasets

Q1:
I recently met you at LMB where you gave a wonderful talk on PsychENCODE data analysis.

You mentioned that there were MS datasets in the PsychENCODE. I am unable to find it. Is it possible for you to point me to that or point me to someone who may know about this? Is it possible for you to point out the MS data in the PschENCODE datasets?

A2:
Could you please explain a little more about what dataset you need?

Q2:
I am looking for Mass Spec data sets in PsychEncode. Mark mentioned that MS analysis were done for some samples. I wonder whether you could help me in identifying them?

A2:
I just checked with our DCC team and currently we don’t have any Mass Spec data available for public sharing.

Q3:
What is dcc team? I was given to believe from the publications that this data was available along with others for analysis. i would not have asked otherwise. is there a way i can reach out to any group among your dcc team that has this data to see whether i can formally collaborate with them? Can you kindly let me know who may be the best person to ask for the details of the group that may have the MS datasets? I am looking for MS data (even if it is published) from any of the samples that were used in the Psychencode project.
I am willing to collaborate and share authorships with the scientists who generated these datasets?
Would it be possible for you to point out to any one whom you may know who may have this dataset (published or unpublished)?

A3:
I have contacted the group that is generating the Mass Spec data. Are you specifically interested in proteomics related to donors with neuropsychiatric disorders? We (Sage Bionetworks) also function as the data coordination center for the NIA funded Accelerating Medicines Partnership – Alzheimer’s Disease (AMP-AD). There are a variety of studies in AMP-AD with Mass Spec proteomics on post mortem brain tissue, that also have other genomic data such as WGS and RNAseq. Included in that is the Religious Orders Study and Memory and Aging project (ROS/MAP) from the Rush Alzheimer’s Disease Center. See here for information on the cohorts. There will be TMT labeled MS on ~400 ROS/MAP donors released this fall.

Q4:
Thank you for getting in touch with me. Thank you for your pointer. Indeed, we will be interested in the Alzheimer’s samples (all the three WGS, RNAseq and Proteomics).
I will write a separate note to you on this.
At the moment, we are looking for MS samples from donors with neuropsychiatric disorders.

A4:
Actually, my lab is doing something very similar as well, validating novel ORFs identified from our third generation sequencing, and riboseq data.
If you use other approaches that we did not use yet, or with some special goals more than just validating ORFs in brain, I will be happy to collaborate.
I have two students/collaborators on this.

Q5:
Is it possible for me to make a quick call?

A5:
…(resolved via phone call on Jul 9, 2019)…

PsychENCODE GRN questions

Q1:
I had a few questions about the Gene Regulatory Networks published as part of the Comprehensive functional genomic resource and integrative model for the human brain at http://resource.psychencode.org/. Could you pass these along to whomever is best suited to address them?

First question: Which reference genome is used?

The GRN has the following format:

Transcription_Factor,Target_Gene,Enhancer_Region,Edge_Weight

And most rows look like:

BARHL2,SHC1,chr1:154869072-154870071,0.284806116416629

however, some rows just have "Promoter" in the Enhancer_Region column, like this one:

NR2F2,SHC1,Promoter,0.120934147846037

But since NR2F2 (and most other genes) have a couple different reference haplotypes in both refseq and gencode (e.g. see NR2F2 in UCSC genome browser), it’s ambiguous to me where "Promoter" designates.

Does there exist a version of the GRN with Promoter substituted for chromosomal coordinates, or would you mind sending a reference to the haplotype you used as reference when building this GRN?

To summarize above: what reference genome did you use in constructing the GRN? What region does "Promoter" evaluate to?

A1:
We defined the promoter regions by a window of ±1.25 kb (=2.5 kb in
total) relative to the transcription start site (TSS) on hg19.

Q2:
Could you send the hg19 reference genome you’re referring to?

If I go to the UCSC browser and look at refseq hg19, for some arbitrary gene: [[see image]]

The gene has multiple reference isoforms. Where does your GRN situate the promoter for this gene? i.e. which chromosomal location does the ChIP track you integrated in your GRN identify the TF at? Chromosomal coordinates would be less ambiguous than stating the TF binds the promoter. Would the production of such a network be possible, or would you be able to send us a reference genome you used with a single location for each promoter (i.e. a single tss)? How did you choose the ‘canonical’ isoform for each gene? What about the promoters upstream of the other tss’s — is there evidence of regulation at those alternate promoters?

Any chance you might be able to resolve this for us? It seems to limit the utility of this network to have this ambiguity about the chromosomal location of these transcriptional regulatory events. It would be a shame not to resolve it, I think.

A2:
I have added the promoter TSS file to our website at: http://resource.psychencode.org/Datasets/Integrative/tss.sites.codingOnly.gencode.v19.annotation.bed

It can be found at resource.psychencode.org by navigating to the section on "Integrative Analysis", and scrolling to item 3.

Bulk Tissue Deconv. Cell Fractions

Q:
I would like to apply the bulk-tissue deconvolution algorithm in your recent paper (Wang et al., 2018) using our own single cell RNA-Seq data and Gandal et al., 2018’s bulk tissue RNA-Seq. I couldn’t find code related to the deconvolution steps in the Gernsetin Lab github page (https://github.com/gersteinlab/PsychENCODE-DSPN) or on the PsychEncode resources page. I only found results to the cell fraction calculations. Would you be able to point me towards how I can apply this algorithm?

A:
We used non-negative least square method for deconvolution and implemented it using R function nnls (https://www.rdocumentation.org/packages/lsei/versions/1.2-0/topics/nnls) For example nnls(C, bi) estimates the cell fractions for ith tissue sample, where C is cell type gene expression matrix (row: gene, column: cell type), and bi is the gene expression vector for ith tissue sample.

Question about deconvolution analysis in PsychENCODE paper

Q:
I have a question about the deconvolution method used in the flagship PsychENCODE paper Comprehensive functional genomic resource and integrative model for the human brain. I would like to perform a similar analysis on my own bulk samples using the single cell expression profiles used in the paper, however it is unclear how these profiles are formed.

Specifically, supplementary file DER-23 lists the cell type fractions for 24 cell types. These coefficients presumably came from solving the following:

B = C * W

Where B is the marker gene by samples matrix, C is the marker gene by cell type matrix, and W is the appropriate weights matrix. How do I go about obtaining or reproducing the 24 cell type profiles? From what I can tell, these profiles were not released along with the other supplemental data sets.

If you could please answer my question or forward this email on the appropriate author(s), I would appreciate it.

A:
Sorry for the late reply. I think the profiles you want are on resource.psychencode.org

Requesting information about cQTL and fQTL data from PsychENCODE

Q:
I am writing in regards to the datasets posted on PsychENCODE website. I noticed that full summary statistics for QTL maps are posted for eQTLs and isoQTLs, but cQTLS and fQTLs only have top SNP information. Is there a chance you could upload full summary stats for cQTLs and fQTLs as well?

A:
We calculated cQTLs and fQTLs differently from eQTLs and isoQTLs. So we only have top SNP information for cQTLs and fQTLs.

Information about datasets from PsychENCODE

Q1:
I am writing to ask where I might be able to find a list of all of the datasets generated for the PsychENCODE project. Specifically, we would like to know how many single-cell and bulk RNA-seq datasets were generated, and what the sex and age is of the samples used to generate these datasets. I was not able to find this information in the supplementary materials from your 2018 Science paper or on the PsychENCODE website, but perhaps I am missing something. Before we start the application process to access the raw data, it would be very helpful to have this information.

A1:
You should be able to find a list of all datasets used for Wang et al.
(’18) from resource.psychencode.org . Please contact Prashant (copied)

Please refer to http://resource.psychencode.org/Datasets/RawData/RAW-01_PEC_Table_of_Datasets.xlsx

This contains the set of datasets associated with the analysis in the Wang et al paper, focusing on the adult samples. Of course, this is a subset of the total PsychENCODE datasets. I can see if there is a simple resource for you to access to get the information from the superset. I will let you know soon.

I am going to answer your question in two parts. Here is Part 1:

The metadata for the prenatal and adult single-nuclei datasets is available at http://development.psychencode.org/ under the "Processed Data" heading, "Single cell/nucleus RNA-seq". The ages and sexes of the sampled individuals can be found in the .xlsx files with the labels "QC" appended.

Q2:
Can you tell me if additional single-cell RNA-seq datasets will be generated in the next phase of this project and what the timeframe might be?

A2:
There will definitely be a significant expansion of the single-cell/nucleus datasets in the next phase, though it is as yet uncertain as to how long that would take. I am hesitant to take a guess right now, but please check back in a couple of months and we may have a better answer.

psych encode derived data types

Q1:
I am looking at
http://resource.psychencode.org/#Derived
under
Derived Data Types
there are a couple of gene expression matrices. What are the columns (samples), ie which ones are which cases and which controls in the header file:
http://resource.psychencode.org/Datasets/Derived/Header_DER-01_PEC_Gene_expression_matrix_normalized.txt

What is the difference between
DER-01_PEC_Gene_expression_matrix_normalized
and
DER-02_PEC_Gene_expression_matrix_TPM
besides the fact that one has 43,886 lines and the other has 57,821 lines and that one has 1,932 columns and the other 1,867 columns.

A1:
1) "What are the columns (samples), ie which ones are which cases and which controls in the header file"

I am unable to pass on this information, as our DCC mentioned that diagnosis information would only be available upon application to Synapse for approval by the NIMH and investigators. Please contact them for access approval.

2)
"What is the difference between
DER-01_PEC_Gene_expression_matrix_normalized
and
DER-02_PEC_Gene_expression_matrix_TPM"

The difference in the numbers is simply between FPKM and TPM units in expression.
3)
"besides the fact that one has 43,886 lines and the other has 57,821 lines"
There is very likely a difference in the thresholding of gene expression applied to these two datasets. I have reached out to my colleague who processed these matrices and will get back to you with a more definite response soon.

4) "that one has 1,932 columns and the other 1,867 columns."

The column number differences arise from the following: 1931 is the original number of DFC samples considered, which includes both adult and non-adult individuals. Once we filtered out the 65 non-adult samples, we obtained the 1866 individuals in the second matrix. Unfortunately, this was not made clear on the website. I will be updating this soon.

Q2:
But wait a minute, these files are useless without at least knowing who the cases and who the controls are?

A2:
Here is the rationale:
PsychENCODE placed restrictions on the dissemination of metadata. While adhering to those restrictions, we endeavored to put out as many of the processed datasets from our analyses as possible to allow for reproduction or downstream usage. This includes several intermediate files. Some may require protected data obtained with the permission of the consortium to perform downstream analyses, but even then the files on our website are in a format that would aid such analyses, and that are not available elsewhere.

For completeness, here are the answers to your original questions:
1) Method for generating DER-01: Using the original FPKM file, we filtered on >=10 individuals with >0.1 FPKM (though GTEx also applied a filter of requiring raw read counts greater than 6 — we did not have the raw data from GTEx, so we didn’t apply a filter on raw read counts).

2) Quantile normalization was performed to bring the expression profile of each sample onto the same scale.

3) To protect from outliers, inverse quantile normalization was performed for each gene, mapping each set of expression values to a standard normal.
2) Method for generating DER-02: The TPM file was converted directly from the original FPKM file

Query about QTL calling from Wang et al PsychEncode paper

Q:

I have a quick query about the Wang et al paper from the PsychENCODE study.
Were the QTLs identified from all the samples or the control samples only?
I’ve checked the paper, online resources and the supplementary methods but can’t seem to work this out.

A:
The QTLs were identified from both control and disease samples. You could find the sample information in Table S11. Summary of dataset.