I am examining your uORF annotations with great interest but am unsure how to interpret a few of the entries in the file below on the github site.
Complete list of predictions (complete_uORF_predictions_hg19.zip · 35.29 MB)
If you look at these two uORF_IDs:
They are annotated with the same start and end coordinates, but different start codons (ATC / ATA).
Also, looking at the region I cannot find either start codon in the hg19 reference.
Any idea what is going on here?
Basically, the start codon here appears to overlie a splice site. Alternative splicing means you could either end up with an ATC or an ATA at that location depending on which processed transcript you are looking at (see image below). That’s why these uORFs have the same start and end coordinate, but different start codons.
We had wrestled a bit with the question of whether or not to call these two separate uORFs. However, they do have different mRNA/protein sequences, so that’s why they received separate entries in our catalog.
I am reading your paper “Supervised enhancer prediction with epigenetic pattern recognition and targeted validation”, and I would greatly appreciate if you could provide some results apparently missing in Figure 2.
I am interested in the AUPR comparison of the matched-filter results with the peak-calling results, but I could not find the "gray" numbers.
Fig. 2 a, ….the gray numbers in the parentheses refer to the performance of the peak-based models.
Thank you for bringing this to our attention and apologies for any confusion. We lost the numbers during one of the revisions. I am attaching a SI figure from an older version of the manuscript that answers your question.
In the table, I have compared the AUROC and AUPR for accuracy of different matched filter models (outside parentheses) with the corresponding peak based accuracy measures (within parentheses) for same histone marks. In this particular case, the comparison is made based on overlap with a single STARR-seq experiment but the trends remain the same even after combining information from multiple STARR-seq experiments within the same cell-line.
We are interested in using the HotCommics pipeline to identify hotspot
communities from our own cancer mutation data. However, we have
difficulty in running the pipeline because we could not find
description of the input files in the snpMapping and the
hotSpotCalculation step. Could you kindly help to provide us some
example input files so that we can appropriately format our input?
Thank you for your interest in our work. The input file for SNP
mapping step is the input file for VAT tool, which can be the vcf file
that you. are working with. Alternatively, you can also use a
tab-separated file with header information described below.
#CHROM hg19_pos ID Ref Alt Tumor_Sample_Barcode
I am just beginning start my first ever project by using the extended gene definition provided in the dataset of Encode for cancer genomics to predict gene expressions. I would be incredibly grateful if there could be an explanation about the layout of the text files. I have been unsuccessfully trying to understand how the extended gene was used to interpret the mutations and expression changes in the published article.
Thanks for your interest in the research and the extended gene annotation. We are preparing BED-formatted extended gene annotation and they will be available soon on our project website (http://encodec.encodeproject.org/). We will keep you informed.
Secondly, we are working with .vcf files in GRCh38 build. Is there a way to run ALoFT using this build, or will we need to do a liftover back down to hg19?
Currently, ALOFT cannot be used with build38. We don’t have a plan to upgrade it to HG38. For SNPs, we already provide scores exome wide based on liftover to HG38. However, if you want other annotated features/scores for indels, it cannot be done without doing a liftover back down. While it is not ideal, that will work.
Q: Would you be able to point me to the repository where all these data are stored? I am looking into the Psychencode repository in synapse but it’s not clear if all the data presented in the publications are included in there and if so, are grouped into one folder? We are particularly interested in the bulk and scRNASeq for now. https://www.synapse.org/#!Synapse:syn5553626
Data access approvals are handled by the NIMH through the NIMH Repository and Genomics Resources. Instructions are on the study page. If you do not have access, and have questions about the process let me know.
I was trying a couple of tools: Morph Server and RigidFinder and in both cases I get a server error indicating that files could not be written. Specifically, RigidFinder complains: "Cann’t write to file ‘/tmp/rid74285/upfile1.pdb’." and Morph Server says "Can’t create morph directory!". If this site is still being maintained, please consider addressing these issues.
Thank you for your interest in our servers and for letting us know of
problems you’ve encountered.
Rigidfinder’s disk filled up. I cleared some space and it should be
Molmovdb, however, is a more complicated issue. It needs an upgrade
since it is more than 15 years old. Occasionally, we simply roll back
to a previous version but then any submissions would be lost. We also
cannot guarantee when the next roll back would be.
I am contacting you as the corresponding author for the paper: "GRAM: A generalized model to predict the molecular effect of a non-coding variant in a cell-type specific manner." PLoS genetics 15.8 (2019): e1007860.
I would like to express my thanks to you and your group for developing & publishing GRAM. I have recently tested it out and the results have been most interesting.
I have begun to work with eQTL analysis only recently and as a result, I was wondering what you would recommend as a multiple testing correction method for GRAM score based eQTL analysis?
From the literature I have seen that standard multiple testing correction methods such as Bonferroni & Benjamini-Hochberg have be considered too conservative for regular eQTL analysis as they do not take linkage disequilibrium into account, and several permutation testing based approaches have been published specifically for eQTL as a result (e.g. eigenMT). However, as you have demonstrated GRAM score based eQTL to be able to differentiate the regulatory effects of variants in linkage disequilibrium, I am unsure whether such methods would be appropriate here.
One of the application scenarios of using GRAM is fine-mapping, which suppose that you have a list of eQTL and its LD associated mutations. If you don’t have eQTL and want to try it on eQTL identification, maybe one way is you compare the gram score with a normally distributed background (use tens of thousands of background/random selected mutations) and infer a p-value of the GRAM score of a variant relative to the background, then use BH or FDR method to do the multi-testing correction.
Frankly speaking, this is a very great point to extend our GRAM. We may also consider testing this recently. The most computation-intensive part of this to calculate deepbind score for background variants, which will take a long time if we want to test millions of background variants. If you have any feedback, further questions or preliminary findings regarding this, please feel free to let us know.