A forum for conversations about published paper

I saw your paper "Structuring supplemental materials in support of reproducibility" and appreciate your points. I would love to see a forum (like GATK’s forum or StackOverflow) where each topic for a conversation thread is a single published paper. Then everyone who is trying to replicate results could post their questions and authors their answers for all to see. I think this would be much better than the current closed system of emailing the authors. I would love to see a day when a link to a forum is provided on papers, rather than the authors’ email addresses.Who would have the ability to make something like this get started and catch on? Do you know if they are thinking about funding a platform for something like this at the NIH?

with respect to "Who would have the ability to make something like this get started and catch on?"
maybe plos

with respect to "Do you know if they are thinking about funding a platform for something like this at the NIH?"
don’t know

Permission to use images

I have been using the Genboree exceRpt workflow, and loving it! It has saved me so much time! Your paper got me on to it, and I would like to use one of the figures (1) of the exceRpt pipeline in my PhD thesis. Am I right to contact you to request permission? Or should I be heading to Cell for this?

fine w/ me – just acknowledge us (see

Retrotransposon Quantitation

I read about the recently published software for deconvoluting pervasive and autonomous retrotransposons. Could another calculation be added to the software’s output which estimates the abundance of ORF1 and ORF2, the parts of the retrotransposon which are translated into protein? I’m not experienced in this research area, so I am unsure of how feasible that is. I would like to make an approximation to the ORF1 and ORF2 protein abundances using RNA-seq.

Thanks for reaching out here and on GitHub. This is an interesting question and suggestion. Unfortunately, estimating the rate of protein abundance of ORF1 and ORF2 from RNA-seq is extremelly hard. There are essentially two factors that make it difficult to estimate protein abundance from transcriptome data. The first is technical. RNA-seq has a strong bias to overrepresenting the 3′ or transcripts, therefore, ORF2 would most likely be overestimated. This is issue is easily addressable.

The second one is more biological: LINE-1 is tightly regulated at many different levels. No only LINE-1 transcription is regulated but there are also many post-transcription mechanisms that either boost or stop LINE-1 translation. This is not only true for LINE-1, in general, estimating protein abundance from RNA is a hard problem (https://www.nature.com/articles/nrg3185).
That said, I’m really interested in this question. In theory, we could use machine learning algorithms to predict ORF1 and ORF2 protein levels based on RNA-seq if we had enough data. This could be an interesting followup work after TeXP

HiC-spector space complexity

I am currently running HiC-spector on mouse genome datasets with bin size 5kb. I noticed that it requires quite a lot of memory, so I was wondering if there were tests done on HiC-spector’s space complexity, as I couldn’t find such studies in the Supplementary Data.

We didn’t do analysis explicitly. Because the contact maps are stored as sparse matrices, the memory won’t grow quadratically. In general, if calculation is done chromosome by chromosome, 5kb should be fine.

EN-TEX data

Your postdoc give a great talk about the EN-TEX work in the ASHG meeting. The data
generated from this project will benefit the community greatly. Could you
please tell when and how the data will be made available for external users?

Thank you for your suggestion. In the mean time, you can find the correct versions of fasta and blast freely available online. For easing the user experience we provide a link to the two packages on the website http://pseudogene.org/pseudopipe/ .

Inquiry about STRESS

I am writing this e-mail to inquire about STRESS software.

We have learned from your paper (Structure 2016,24:826-837)
that STRESS software can be used for identifying allosteric pockets.
We are interested in using the software for our drug discovery research.
We will perform evaluation of the software for a start.

Will you allow us to use STRESS software for the purpose of our
commercial drug discovery project free of charge?

As this is an urgent project, we would highly appreciate if you could
reply soon.

see license at https://sites.gersteinlab.org/permissions/

HiC-Spector data

We have read with much interest your article about the HiC-Spector method.
We are currently working on a method that we hope will help identify
conserved features across different HiC-maps. As the problem we are studying
and the one tackled in your article are closely related, we think it would
be useful for us to test our method using your data set as the ground truth.
We kindly ask whether you would be able to provide us with the HiC maps used
in the article for this purpose.


Indel counts for RCC WGS paper

We’ve just been reading your excellent papillary RCC WGS paper- there is a
real paucity of data on papillary cases, so many thanks for this.

Sorry if I missed it, but do you happen to know the SNV and (small scale)
indel counts across the cohort? We’re especially interested in indel
mutations in RCC, and wandered what proportion of your variants were of this

For tumor SNV counts, you can find them in the supplemental table (https://doi.org/10.1371/journal.pgen.1006685.s009). We also include SVs in the supplements too. Unfortunately, we do not have indels for those tumors.

Loregic – further validation

I’ve been trying to apply the Loregic algorithm in other organisms in order to further validate the method, however I’m finding some inconsistencies that could be related to data manipulation (choosing datasets, merging and mean-centering samples).
Furthermore, I’ve also found those inconsistencies when trying to reproduce the analysis from yeast datasets provided in your publication (probably due to the same data manipulation issues described before).

Would you be able to provide a more in-depth protocol for using Loregic with multiple datasets (how you handled the data, for example) in order to improve the consistency of the method between labs?

Yes, we normalized the yeast data. Here was how we preprocessed:

1) got time-series yeast cell cycle data (alpha, cdc15, cdc28) from
which were logarithm values.
2) standardized(2^(data)) s.t., each time point has mean=0, and sigma=1
3) binarized the standardized data using the function,
binarizeTimeSeries with ‘kmeans’ clustering in R package BoolNet.