Bulk Tissue Deconv. Cell Fractions

Posted on May 31, 2019 by gersteinfaq

Q:
I would like to apply the bulk-tissue deconvolution algorithm in your recent paper (Wang et al., 2018) using our own single cell RNA-Seq data and Gandal et al., 2018’s bulk tissue RNA-Seq. I couldn’t find code related to the deconvolution steps in the Gernsetin Lab github page (https://github.com/gersteinlab/PsychENCODE-DSPN) or on the PsychEncode resources page. I only found results to the cell fraction calculations. Would you be able to point me towards how I can apply this algorithm?

A:
We used non-negative least square method for deconvolution and implemented it using R function nnls (https://www.rdocumentation.org/packages/lsei/versions/1.2-0/topics/nnls) For example nnls(C, bi) estimates the cell fractions for ith tissue sample, where C is cell type gene expression matrix (row: gene, column: cell type), and bi is the gene expression vector for ith tissue sample.

Mouse Transcribed Pseudogene Data

Posted on May 23, 2019 by gersteinfaq

Q:
I’m currently working on how pseudogenes can act as competitive endogenous RNAs in humans, and would like to expand my study to include mice. I recently read a paper from your lab, Comparative analysis of pseudogenes across three phyla, and in the supplementary information you mention that you identified 878 transcribed pseudogenes in the mouse genome. Is there a list of these pseudogenes as well as their associated parent genes available on either the pseudogene.org website or on a different website?

A:
I think this draft list should be on the psicube site .

Questions about using PseudoPipe

Posted on May 19, 2019 by gersteinfaq

Q1:
First of all I must show great respect to your brilliant work on developing the PseudoPipe software.
Now I am working on my graduate paper, and need to use this software. But I met some problems, so any guide or assistance from you would be appreciated.
I just download the software package from your website and unpack it in my home directory(that is ~/), but when I test it according to your manual, it reported errors as below:
I have tried several ways to fix it ,even trying to modify the source code, but failed. I’ve been driven somehow crazy haha.
Can you please provide some suggestions? thanks in advance!

A1:
It looks like your installation is not referencing python properly. Please edit the env.sh file with the appropriate source/path for python in your system.

Q2:
According to your suggestion, now I have finished all the environment variable setting in env.sh, but I still got error while running the software(as the below Fig.1)
So I try to fix the code of pseudopipe.sh , and I finally made it run just by modifying the "source setenvPipelineVars" into "source ./setenvPipelineVars" at line 141. And I got the final result file(as Fig. 2) by running your sample data. Is the result correct?
Don’t know if anybody reported similar error before. If not, I hope it would contribute to improving your powerful software. And it would be great if you can also display on your manual or README what the standard output and final result file look like when testing the sample data.

A2:
The results look right. Thank you for your suggestions, we will take them into account in a future update of the pipeline.

Good luck with your analysis.

Question about deconvolution analysis in PsychENCODE paper

Posted on May 19, 2019 by gersteinfaq

Q:
I have a question about the deconvolution method used in the flagship PsychENCODE paper Comprehensive functional genomic resource and integrative model for the human brain. I would like to perform a similar analysis on my own bulk samples using the single cell expression profiles used in the paper, however it is unclear how these profiles are formed.

Specifically, supplementary file DER-23 lists the cell type fractions for 24 cell types. These coefficients presumably came from solving the following:

B = C * W

Where B is the marker gene by samples matrix, C is the marker gene by cell type matrix, and W is the appropriate weights matrix. How do I go about obtaining or reproducing the 24 cell type profiles? From what I can tell, these profiles were not released along with the other supplemental data sets.

If you could please answer my question or forward this email on the appropriate author(s), I would appreciate it.

A:
Sorry for the late reply. I think the profiles you want are on resource.psychencode.org

Requesting information about cQTL and fQTL data from PsychENCODE

Posted on May 19, 2019 by gersteinfaq

Q:
I am writing in regards to the datasets posted on PsychENCODE website. I noticed that full summary statistics for QTL maps are posted for eQTLs and isoQTLs, but cQTLS and fQTLs only have top SNP information. Is there a chance you could upload full summary stats for cQTLs and fQTLs as well?

A:
We calculated cQTLs and fQTLs differently from eQTLs and isoQTLs. So we only have top SNP information for cQTLs and fQTLs.

Information about datasets from PsychENCODE

Posted on May 17, 2019 by gersteinfaq

Q1:
I am writing to ask where I might be able to find a list of all of the datasets generated for the PsychENCODE project. Specifically, we would like to know how many single-cell and bulk RNA-seq datasets were generated, and what the sex and age is of the samples used to generate these datasets. I was not able to find this information in the supplementary materials from your 2018 Science paper or on the PsychENCODE website, but perhaps I am missing something. Before we start the application process to access the raw data, it would be very helpful to have this information.

A1:
You should be able to find a list of all datasets used for Wang et al.
(’18) from resource.psychencode.org . Please contact Prashant (copied)

Please refer to http://resource.psychencode.org/Datasets/RawData/RAW-01_PEC_Table_of_Datasets.xlsx

This contains the set of datasets associated with the analysis in the Wang et al paper, focusing on the adult samples. Of course, this is a subset of the total PsychENCODE datasets. I can see if there is a simple resource for you to access to get the information from the superset. I will let you know soon.

I am going to answer your question in two parts. Here is Part 1:

The metadata for the prenatal and adult single-nuclei datasets is available at http://development.psychencode.org/ under the "Processed Data" heading, "Single cell/nucleus RNA-seq". The ages and sexes of the sampled individuals can be found in the .xlsx files with the labels "QC" appended.

Q2:
Can you tell me if additional single-cell RNA-seq datasets will be generated in the next phase of this project and what the timeframe might be?

A2:
There will definitely be a significant expansion of the single-cell/nucleus datasets in the next phase, though it is as yet uncertain as to how long that would take. I am hesitant to take a guess right now, but please check back in a couple of months and we may have a better answer.

psych encode derived data types

Posted on May 17, 2019 by gersteinfaq

Q1:
I am looking at
http://resource.psychencode.org/#Derived
under
Derived Data Types
there are a couple of gene expression matrices. What are the columns (samples), ie which ones are which cases and which controls in the header file:
http://resource.psychencode.org/Datasets/Derived/Header_DER-01_PEC_Gene_expression_matrix_normalized.txt

What is the difference between
DER-01_PEC_Gene_expression_matrix_normalized
and
DER-02_PEC_Gene_expression_matrix_TPM
besides the fact that one has 43,886 lines and the other has 57,821 lines and that one has 1,932 columns and the other 1,867 columns.

A1:
1) "What are the columns (samples), ie which ones are which cases and which controls in the header file"

I am unable to pass on this information, as our DCC mentioned that diagnosis information would only be available upon application to Synapse for approval by the NIMH and investigators. Please contact them for access approval.

2)
"What is the difference between
DER-01_PEC_Gene_expression_matrix_normalized
and
DER-02_PEC_Gene_expression_matrix_TPM"

The difference in the numbers is simply between FPKM and TPM units in expression.
3)
"besides the fact that one has 43,886 lines and the other has 57,821 lines"
There is very likely a difference in the thresholding of gene expression applied to these two datasets. I have reached out to my colleague who processed these matrices and will get back to you with a more definite response soon.

4) "that one has 1,932 columns and the other 1,867 columns."

The column number differences arise from the following: 1931 is the original number of DFC samples considered, which includes both adult and non-adult individuals. Once we filtered out the 65 non-adult samples, we obtained the 1866 individuals in the second matrix. Unfortunately, this was not made clear on the website. I will be updating this soon.

Q2:
But wait a minute, these files are useless without at least knowing who the cases and who the controls are?

A2:
Here is the rationale:
PsychENCODE placed restrictions on the dissemination of metadata. While adhering to those restrictions, we endeavored to put out as many of the processed datasets from our analyses as possible to allow for reproduction or downstream usage. This includes several intermediate files. Some may require protected data obtained with the permission of the consortium to perform downstream analyses, but even then the files on our website are in a format that would aid such analyses, and that are not available elsewhere.

For completeness, here are the answers to your original questions:
1) Method for generating DER-01: Using the original FPKM file, we filtered on >=10 individuals with >0.1 FPKM (though GTEx also applied a filter of requiring raw read counts greater than 6 — we did not have the raw data from GTEx, so we didn’t apply a filter on raw read counts).

2) Quantile normalization was performed to bring the expression profile of each sample onto the same scale.

3) To protect from outliers, inverse quantile normalization was performed for each gene, mapping each set of expression values to a standard normal.
2) Method for generating DER-02: The TPM file was converted directly from the original FPKM file

Question about publication data “Comparative analysis of pseudogenes across,three phyla”

Posted on May 11, 2019 by gersteinfaq

Q:
I’m looking at some of the data connected with your recent publication and was wondering if I could get clarification on the BioType attribute in the following file:

http://www.pseudogene.org/psicube/data/Worm-Annotation.bed

In here there appear to be 3 biotypes

processed_pseudogene
pseudogene
unprocessed_pseudogene

Looking through the paper and the supplementary material I can find reference to processed_pseudogene and unprocessed_pseudogene, but not the generic pseudogene? Reading the S1 material I would not expect to see this 3rd biotype

–snip–
(a) Classification
Pseudogenes were classified as “processed” if they have lost their parental gene structures.
Conversely, we classified pseudogenes as “unprocessed”/ “duplicated” if they retained the
same exon-intron structure as their parent loci. In ambiguous cases we used other features to
resolve the provenance of the pseudogene. Where the pseudogene represented a fragment of
the parent, and the homology ended precisely at a splice junction the pseudogene was called
“unprocessed” (“duplicated”). Conversely, where the fragment contained the fusion of two or
more exons the pseudogene was called “processed”. If the parent had a single exon CDS, the
presence of parent gene structure in the 5′ UTR region (identified by alignment of mRNA and
EST evidence) allowed the pseudogene to be called “unprocessed”/“duplicated”. Meanwhile,
the presence of a pseudopoly(A) signal (the position of the parent poly(A) signal at the
pseudogene locus) followed by a tract of A-rich sequence in the genome (indicating the
insertion site of the polyadenylated parental mRNA) indicated a “processed” pseudogene. If
there was no other evidence available to resolve the route by which the pseudogene was
created, we used the position of the pseudogene relative to its parent. As such “processed”
pseudogenes are reinserted into the genome with an approximately random distribution while
“unprocessed”/“duplicated” pseudogenes tend to be more closely associated with the parent
locus. Parsimony therefore suggests that pseudogenes that lie near to the parent locus are
more likely to have arisen via a gene-duplication event than retrotransposition, and this was
used as a tie-breaker in defining the pseudogene biotype.
–snip–

I hope I haven’t missed anything obvious, but any clarification would help greatly.

A:
When we classify the pseudogenes according to their biotype we have processed pseudogenes and duplicated pseudogenes. This biotype is dependent on the pseudogene formation process (retrotransposition vs duplication) and this is the description that you see in the supplementary material. The third biotype that you find in some of the files on psicube website is actually not a biotype per se, these pseudogenes are most of the time highly degraded or short fragments and we could not assign with high confidence a definite biotype to them. In other words the pseudogenes with “pseudogene” as biotype have actually an undetermined biotype. But instead of saying “NA” (not available or unknown) we opted to simply call them “pseudogene".

Spectral biclustering

Posted on May 11, 2019 by gersteinfaq

Q:
I recently read
your 2003 paper titled "Spectral biclustering of microarray data: Coclustering
genes and conditions".

I would like to investigate implementing your approach on a GPU.
Is there any code (Matlab? Python?) you would be willing to share as a result of the paper?

A:
Sorry we’re just using simple SVD routines in matlab. No meaningful code available. -marK

OrthoClust – for more than two species

Posted on May 11, 2019 by gersteinfaq

Q:
I just read your recently published paper on OrthoClust approach. It is a well grounded work in both practically and mathematically point of views.

I ran your R scripts for my own data and It worked perfectly fine, however I am wondering how can I use the script for more than two species?

It could be appreciated if you help me to find the solution.

A:
Thanks for your interest in OrthoClust. Orthoclust definitely works on more than 2. The R script is a primitive version for illustrating the concept outlined in the paper. We understand the importance of N-species generalization. We have put a new MATLAB code for N-species. It made use of an efficient code written by Mucha and Porter that implemented the Louvain algorithm for modularity optimization. The 3rd party code as well as our wrapper is now in the gersteinlab github.
Apart from MATLAB, we are planning to provide wrapper for Python or R later.
The N-species code is not exactly the thing we did for the paper. So if you find any bug or question, please let me know. we are trying to make a more user friendly package anyway.

Gerstein Lab FAQs

Frequently Asked Questions

Monthly Archives: May 2019

Bulk Tissue Deconv. Cell Fractions

Mouse Transcribed Pseudogene Data

Questions about using PseudoPipe

Question about deconvolution analysis in PsychENCODE paper

Requesting information about cQTL and fQTL data from PsychENCODE

Information about datasets from PsychENCODE

psych encode derived data types

Question about publication data “Comparative analysis of pseudogenes across,three phyla”

Spectral biclustering

OrthoClust – for more than two species