I read about the recently published software for deconvoluting pervasive and autonomous retrotransposons. Could another calculation be added to the software’s output which estimates the abundance of ORF1 and ORF2, the parts of the retrotransposon which are translated into protein? I’m not experienced in this research area, so I am unsure of how feasible that is. I would like to make an approximation to the ORF1 and ORF2 protein abundances using RNA-seq.
Thanks for reaching out here and on GitHub. This is an interesting question and suggestion. Unfortunately, estimating the rate of protein abundance of ORF1 and ORF2 from RNA-seq is extremelly hard. There are essentially two factors that make it difficult to estimate protein abundance from transcriptome data. The first is technical. RNA-seq has a strong bias to overrepresenting the 3′ or transcripts, therefore, ORF2 would most likely be overestimated. This is issue is easily addressable.
The second one is more biological: LINE-1 is tightly regulated at many different levels. No only LINE-1 transcription is regulated but there are also many post-transcription mechanisms that either boost or stop LINE-1 translation. This is not only true for LINE-1, in general, estimating protein abundance from RNA is a hard problem (https://www.nature.com/articles/nrg3185).
That said, I’m really interested in this question. In theory, we could use machine learning algorithms to predict ORF1 and ORF2 protein levels based on RNA-seq if we had enough data. This could be an interesting followup work after TeXP
I would like to run your new SVFX method on some structural variants. For full disclosure, I’m working on a method to assess the pathogenicity of germline SVs, and would like to compare with yours. Based on reading your preprint, I believe our methods are quite distinct in terms of training data. I think it’s great you’ve already put code on github, but I’m not sure what data files are needed to run the code. Could you put me in touch with one of your students to help me run SVFX locally?
Thanks for your interest in SVFX. We have reported our feature list in supplement table1.
Overall, our feature list is extracted from a bunch of genomic annotations and various functional genomics/epigenomics signal files.
You can download signal files from iHEC or epigenome roadmap data portal. As you might have noticed, we created multiple tissue-specific models for our analysis.
For the germline model, we also built a feature matrix based on the h1HESC cell line, which performed quite well. On the SVFX GitHub page, we have uploaded the bed file for different annotations (under the data folder) used in our study.
I am currently running HiC-spector on mouse genome datasets with bin size 5kb. I noticed that it requires quite a lot of memory, so I was wondering if there were tests done on HiC-spector’s space complexity, as I couldn’t find such studies in the Supplementary Data.
We didn’t do analysis explicitly. Because the contact maps are stored as sparse matrices, the memory won’t grow quadratically. In general, if calculation is done chromosome by chromosome, 5kb should be fine.
I am reading with interest your recent paper (Kumar, Clarke, and Gerstein, PNAS), but I suspect that supplement 1 and 2 are the same, and neither has a list of 434 genes. Could you please supply the list?
Thank you very much for your interest in the paper. Supplement 1 includes hotspot communities based on pan-cancer analysis (i.e., when will compute statistics over multiple cancer cohorts in TCGA). In contrast, supplement 2 lists out putative driver genes with hotspot communities for specific cancer types. If you note in supplement2, column F list out the name of particular cancer cohorts.
Regarding the number of genes, 434 genes are based on the pan-cancer analysis.
For each gene, there are multiple PDB entries. For analysis in our paper, we selected a representative structure with the highest residue coverage. However, to be exhaustive and allow researchers to analyze protein of their interest, in our supplement, we include all PDB entries for a given gene. We have tried to explain this in our method section.
Thanks for your quick reply; but, no, this does not remove my confusion. Please take a moment to check the link from your paper at PNAS. When I download pnas.1901156116.sd01.xlsx, the file has 217 lines (not 434) and includes the column F that breaksdown by cancer type.
I am attaching our original tables with the email. It appears that the table has been somehow duplicated on the PNAS website. We will work with the PNAS team to get it fixed.
I recently read the ENCODE paper "Architecture of the human regulatory network derived from ENCODE data", and I realized that the supplementary data will greatly help me to refine projects results, in particular those files related to the K562. Unfortunately, I found that all the supplementary data files are not available to download, since both of the following sites can’t be reached.
In particular, the second link is active, but if I try to download one of the files, it points to the first link and the download is interrupted. I am writing to ask if there are any other ways to access the files.