Query Regarding GRAM eQTL & MTC

Q:
I am contacting you as the corresponding author for the paper: "GRAM: A generalized model to predict the molecular effect of a non-coding variant in a cell-type specific manner." PLoS genetics 15.8 (2019): e1007860.

I would like to express my thanks to you and your group for developing & publishing GRAM. I have recently tested it out and the results have been most interesting.

I have begun to work with eQTL analysis only recently and as a result, I was wondering what you would recommend as a multiple testing correction method for GRAM score based eQTL analysis?

From the literature I have seen that standard multiple testing correction methods such as Bonferroni & Benjamini-Hochberg have be considered too conservative for regular eQTL analysis as they do not take linkage disequilibrium into account, and several permutation testing based approaches have been published specifically for eQTL as a result (e.g. eigenMT). However, as you have demonstrated GRAM score based eQTL to be able to differentiate the regulatory effects of variants in linkage disequilibrium, I am unsure whether such methods would be appropriate here.

A:
One of the application scenarios of using GRAM is fine-mapping, which suppose that you have a list of eQTL and its LD associated mutations. If you don’t have eQTL and want to try it on eQTL identification, maybe one way is you compare the gram score with a normally distributed background (use tens of thousands of background/random selected mutations) and infer a p-value of the GRAM score of a variant relative to the background, then use BH or FDR method to do the multi-testing correction.

Frankly speaking, this is a very great point to extend our GRAM. We may also consider testing this recently. The most computation-intensive part of this to calculate deepbind score for background variants, which will take a long time if we want to test millions of background variants. If you have any feedback, further questions or preliminary findings regarding this, please feel free to let us know.

Small question of the paper “Passenger Mutations in More Than 2,500 Cancer Genomes: Overall Molecular Functional Impact and Consequences”

Q:
Recently, I read a paper which was published in Cell, titled "Passenger Mutations in More Than 2,500 Cancer Genomes: Overall Molecular Functional Impact and Consequences". Cause of my research topic was similar with this paper, just one of question about Figure 2B. In this heatmap, I saw totally 80 motifs on the bottom, but only 70 rows up to them, I was a little bit confused how did you know the ETS motif matched to the marked row?

A:
The rows in the figure correspond to different cancer cohorts or meta-cohorts. We also provide this information on the cancer cohort with significant differential burdening in Supplement 1 in the paper.

PCAWG passenger mutation analysis

Q1:
I was trying to download a subset of data from your recent paper (https://www.cell.com/cell/fulltext/S0092-8674(20)30113-6). However, the website is returning ‘not found’ error (http://pcawg.gersteinlab.org/). Especially, I am interested in ‘Gene list categories’. Therefore, I kindly request you to share relevant files listed under ‘Gene List Categories’ on the website, so I could use in my analysis.

A1:
The website works fine for me. Sure it doesn’t work ? … Please let me know which specific file are you trying to download.

Q2:
Thanks a lot for the reply.

I need the gene list categories listed under PCAWG-specific annotations (http://pcawg.gersteinlab.org/#Annotations)

Eseential Genes
Immune Response Genes
DNA repair Genes
Metabolic Genes
Cancer Pathway Genes
non-Essential Genes
cell Cycle Genes
For some reason, when I click on the link, it’s directly downloading the html file with error. It would be great if you could share these files.

A2:
You can download relevant files from the link listed below.

http://pcawg.gersteinlab.org/Datasets/Annotations/categories/

Referring to your paper: Structuring supplemental materials in support of reproducibility

Q:
I just read your paper mentioned above. I work in the area of
computational reproducibility so the paper was pretty interesting to
read. However, I stumbled a bit over one of your concluding remarks. You
are saying

"One useful tactic may be detailed sampling: perhaps it is best for the
editor to organize a system wherein, randomly, referees are asked to
review samples in greater detail to ensure the overall quality of the
supplements without quickly overwhelming the peer review system."

I am not sure whether I understood correctly how this could be
implemented. Does it mean that the editor randomly asks one of the
reviewers to look at the supplements, or do all reviewers look at
subsets of supplements? I find this idea pretty interesting and was
wondering whether you have published further articles on this topic?

A:
With respect to: "Does it mean that the editor randomly asks one of the reviewers to look at the supplements, or do all reviewers look at subsets of supplements?"
—> The former

With respect to: "I find this idea pretty interesting and was wondering whether you have published further articles on this topic?"
—> Not exactly.., but you might find useful the related work:
http://papers.gersteinlab.org/papers/structbl
http://papers.gersteinlab.org/papers/SDA

A forum for conversations about published paper

Q:
I saw your paper "Structuring supplemental materials in support of reproducibility" and appreciate your points. I would love to see a forum (like GATK’s forum or StackOverflow) where each topic for a conversation thread is a single published paper. Then everyone who is trying to replicate results could post their questions and authors their answers for all to see. I think this would be much better than the current closed system of emailing the authors. I would love to see a day when a link to a forum is provided on papers, rather than the authors’ email addresses.Who would have the ability to make something like this get started and catch on? Do you know if they are thinking about funding a platform for something like this at the NIH?

A:
with respect to "Who would have the ability to make something like this get started and catch on?"
maybe plos

with respect to "Do you know if they are thinking about funding a platform for something like this at the NIH?"
don’t know

Permission to use images

Q:
I have been using the Genboree exceRpt workflow, and loving it! It has saved me so much time! Your paper got me on to it, and I would like to use one of the figures (1) of the exceRpt pipeline in my PhD thesis. Am I right to contact you to request permission? Or should I be heading to Cell for this?

A:
fine w/ me – just acknowledge us (see
https://sites.gersteinlab.org/permissions/)

Retrotransposon Quantitation

Q:
I read about the recently published software for deconvoluting pervasive and autonomous retrotransposons. Could another calculation be added to the software’s output which estimates the abundance of ORF1 and ORF2, the parts of the retrotransposon which are translated into protein? I’m not experienced in this research area, so I am unsure of how feasible that is. I would like to make an approximation to the ORF1 and ORF2 protein abundances using RNA-seq.

A:
Thanks for reaching out here and on GitHub. This is an interesting question and suggestion. Unfortunately, estimating the rate of protein abundance of ORF1 and ORF2 from RNA-seq is extremelly hard. There are essentially two factors that make it difficult to estimate protein abundance from transcriptome data. The first is technical. RNA-seq has a strong bias to overrepresenting the 3′ or transcripts, therefore, ORF2 would most likely be overestimated. This is issue is easily addressable.

The second one is more biological: LINE-1 is tightly regulated at many different levels. No only LINE-1 transcription is regulated but there are also many post-transcription mechanisms that either boost or stop LINE-1 translation. This is not only true for LINE-1, in general, estimating protein abundance from RNA is a hard problem (https://www.nature.com/articles/nrg3185).
That said, I’m really interested in this question. In theory, we could use machine learning algorithms to predict ORF1 and ORF2 protein levels based on RNA-seq if we had enough data. This could be an interesting followup work after TeXP

HiC-spector space complexity

Q:
I am currently running HiC-spector on mouse genome datasets with bin size 5kb. I noticed that it requires quite a lot of memory, so I was wondering if there were tests done on HiC-spector’s space complexity, as I couldn’t find such studies in the Supplementary Data.

A:
We didn’t do analysis explicitly. Because the contact maps are stored as sparse matrices, the memory won’t grow quadratically. In general, if calculation is done chromosome by chromosome, 5kb should be fine.