I am looking at
Derived Data Types
there are a couple of gene expression matrices. What are the columns (samples), ie which ones are which cases and which controls in the header file:
What is the difference between
besides the fact that one has 43,886 lines and the other has 57,821 lines and that one has 1,932 columns and the other 1,867 columns.
1) "What are the columns (samples), ie which ones are which cases and which controls in the header file"
I am unable to pass on this information, as our DCC mentioned that diagnosis information would only be available upon application to Synapse for approval by the NIMH and investigators. Please contact them for access approval.
"What is the difference between
The difference in the numbers is simply between FPKM and TPM units in expression.
"besides the fact that one has 43,886 lines and the other has 57,821 lines"
There is very likely a difference in the thresholding of gene expression applied to these two datasets. I have reached out to my colleague who processed these matrices and will get back to you with a more definite response soon.
4) "that one has 1,932 columns and the other 1,867 columns."
The column number differences arise from the following: 1931 is the original number of DFC samples considered, which includes both adult and non-adult individuals. Once we filtered out the 65 non-adult samples, we obtained the 1866 individuals in the second matrix. Unfortunately, this was not made clear on the website. I will be updating this soon.
But wait a minute, these files are useless without at least knowing who the cases and who the controls are?
Here is the rationale:
PsychENCODE placed restrictions on the dissemination of metadata. While adhering to those restrictions, we endeavored to put out as many of the processed datasets from our analyses as possible to allow for reproduction or downstream usage. This includes several intermediate files. Some may require protected data obtained with the permission of the consortium to perform downstream analyses, but even then the files on our website are in a format that would aid such analyses, and that are not available elsewhere.
For completeness, here are the answers to your original questions:
1) Method for generating DER-01: Using the original FPKM file, we filtered on >=10 individuals with >0.1 FPKM (though GTEx also applied a filter of requiring raw read counts greater than 6 — we did not have the raw data from GTEx, so we didn’t apply a filter on raw read counts).
2) Quantile normalization was performed to bring the expression profile of each sample onto the same scale.
3) To protect from outliers, inverse quantile normalization was performed for each gene, mapping each set of expression values to a standard normal.
2) Method for generating DER-02: The TPM file was converted directly from the original FPKM file