I am examining your uORF annotations with great interest but am unsure how to interpret a few of the entries in the file below on the github site.
Complete list of predictions (complete_uORF_predictions_hg19.zip · 35.29 MB)
If you look at these two uORF_IDs:
They are annotated with the same start and end coordinates, but different start codons (ATC / ATA).
Also, looking at the region I cannot find either start codon in the hg19 reference.
Any idea what is going on here?
Basically, the start codon here appears to overlie a splice site. Alternative splicing means you could either end up with an ATC or an ATA at that location depending on which processed transcript you are looking at (see image below). That’s why these uORFs have the same start and end coordinate, but different start codons.
We had wrestled a bit with the question of whether or not to call these two separate uORFs. However, they do have different mRNA/protein sequences, so that’s why they received separate entries in our catalog.
I am reading your paper “Supervised enhancer prediction with epigenetic pattern recognition and targeted validation”, and I would greatly appreciate if you could provide some results apparently missing in Figure 2.
I am interested in the AUPR comparison of the matched-filter results with the peak-calling results, but I could not find the "gray" numbers.
Fig. 2 a, ….the gray numbers in the parentheses refer to the performance of the peak-based models.
Thank you for bringing this to our attention and apologies for any confusion. We lost the numbers during one of the revisions. I am attaching a SI figure from an older version of the manuscript that answers your question.
In the table, I have compared the AUROC and AUPR for accuracy of different matched filter models (outside parentheses) with the corresponding peak based accuracy measures (within parentheses) for same histone marks. In this particular case, the comparison is made based on overlap with a single STARR-seq experiment but the trends remain the same even after combining information from multiple STARR-seq experiments within the same cell-line.
We are interested in using the HotCommics pipeline to identify hotspot
communities from our own cancer mutation data. However, we have
difficulty in running the pipeline because we could not find
description of the input files in the snpMapping and the
hotSpotCalculation step. Could you kindly help to provide us some
example input files so that we can appropriately format our input?
Thank you for your interest in our work. The input file for SNP
mapping step is the input file for VAT tool, which can be the vcf file
that you. are working with. Alternatively, you can also use a
tab-separated file with header information described below.
#CHROM hg19_pos ID Ref Alt Tumor_Sample_Barcode
I am just beginning start my first ever project by using the extended gene definition provided in the dataset of Encode for cancer genomics to predict gene expressions. I would be incredibly grateful if there could be an explanation about the layout of the text files. I have been unsuccessfully trying to understand how the extended gene was used to interpret the mutations and expression changes in the published article.
Thanks for your interest in the research and the extended gene annotation. We are preparing BED-formatted extended gene annotation and they will be available soon on our project website (http://encodec.encodeproject.org/). We will keep you informed.
Secondly, we are working with .vcf files in GRCh38 build. Is there a way to run ALoFT using this build, or will we need to do a liftover back down to hg19?
Currently, ALOFT cannot be used with build38. We don’t have a plan to upgrade it to HG38. For SNPs, we already provide scores exome wide based on liftover to HG38. However, if you want other annotated features/scores for indels, it cannot be done without doing a liftover back down. While it is not ideal, that will work.
Q: Would you be able to point me to the repository where all these data are stored? I am looking into the Psychencode repository in synapse but it’s not clear if all the data presented in the publications are included in there and if so, are grouped into one folder? We are particularly interested in the bulk and scRNASeq for now. https://www.synapse.org/#!Synapse:syn5553626
Data access approvals are handled by the NIMH through the NIMH Repository and Genomics Resources. Instructions are on the study page. If you do not have access, and have questions about the process let me know.