Question about uORF annotation in NAR paper

I am examining your uORF annotations with great interest but am unsure how to interpret a few of the entries in the file below on the github site.

Complete list of predictions ( · 35.29 MB)

If you look at these two uORF_IDs:



They are annotated with the same start and end coordinates, but different start codons (ATC / ATA).

Also, looking at the region I cannot find either start codon in the hg19 reference.

Any idea what is going on here?

Basically, the start codon here appears to overlie a splice site. Alternative splicing means you could either end up with an ATC or an ATA at that location depending on which processed transcript you are looking at (see image below). That’s why these uORFs have the same start and end coordinate, but different start codons.

We had wrestled a bit with the question of whether or not to call these two separate uORFs. However, they do have different mRNA/protein sequences, so that’s why they received separate entries in our catalog.

Supervised enhancer prediction with epigenetic pattern recognition and targeted validation

I am reading your paper “Supervised enhancer prediction with epigenetic pattern recognition and targeted validation”, and I would greatly appreciate if you could provide some results apparently missing in Figure 2.

I am interested in the AUPR comparison of the matched-filter results with the peak-calling results, but I could not find the "gray" numbers.

Fig. 2 a, ….the gray numbers in the parentheses refer to the performance of the peak-based models.

Thank you for bringing this to our attention and apologies for any confusion. We lost the numbers during one of the revisions. I am attaching a SI figure from an older version of the manuscript that answers your question.

In the table, I have compared the AUROC and AUPR for accuracy of different matched filter models (outside parentheses) with the corresponding peak based accuracy measures (within parentheses) for same histone marks. In this particular case, the comparison is made based on overlap with a single STARR-seq experiment but the trends remain the same even after combining information from multiple STARR-seq experiments within the same cell-line.

Full set of tQTLs and isoQTLs from Wang et al. 2018

we have made great use of the publicly available PEC resources on, in particular the QTL data. However, I have not been able to locate the full set of isoQTLs and tQTLs without any p-value/FDR filtering, as is available for eQTLs. Is there somewhere I can access this easily? Or does access to the full set of tQTLs and isoQTLs require an application to Synapse?

Currently we don’t provide access to the full set. The full set is very large and we need to discuss where we should share these data. I will let you know once we have any updates.

Request for example input and output files of Hotspot Community pipeline

We are interested in using the HotCommics pipeline to identify hotspot
communities from our own cancer mutation data. However, we have
difficulty in running the pipeline because we could not find
description of the input files in the snpMapping and the
hotSpotCalculation step. Could you kindly help to provide us some
example input files so that we can appropriately format our input?

Thank you for your interest in our work. The input file for SNP
mapping step is the input file for VAT tool, which can be the vcf file
that you. are working with. Alternatively, you can also use a
tab-separated file with header information described below.

#CHROM hg19_pos ID Ref Alt Tumor_Sample_Barcode
Matched_Norm_Sample_Barcode Info

For the hotspot community identification, you will have to run the
community identification module for each PDBs on which your mutations
have mapped to

Once you have generated these communities and have a list of PDBs on
which mutations have mapped to then you will need to provide the list
of PDBs for hotspot calculation.

Encode for cancer genomics to predict gene expression

I am just beginning start my first ever project by using the extended gene definition provided in the dataset of Encode for cancer genomics to predict gene expressions. I would be incredibly grateful if there could be an explanation about the layout of the text files. I have been unsuccessfully trying to understand how the extended gene was used to interpret the mutations and expression changes in the published article.

Thanks for your interest in the research and the extended gene annotation. We are preparing BED-formatted extended gene annotation and they will be available soon on our project website ( We will keep you informed.

ALoFT in hg38?

Secondly, we are working with .vcf files in GRCh38 build. Is there a way to run ALoFT using this build, or will we need to do a liftover back down to hg19?

Currently, ALOFT cannot be used with build38. We don’t have a plan to upgrade it to HG38. For SNPs, we already provide scores exome wide based on liftover to HG38. However, if you want other annotated features/scores for indels, it cannot be done without doing a liftover back down. While it is not ideal, that will work.

Data access in Psychencode repository

Q: Would you be able to point me to the repository where all these data are stored? I am looking into the Psychencode repository in synapse but it’s not clear if all the data presented in the publications are included in there and if so, are grouped into one folder? We are particularly interested in the bulk and scRNASeq for now.!Synapse:syn5553626

We have recently created a portal for easier access to the data generated through the PEC. Please see the [SingleCellRNAseq study]( which this data came from. Note the link under the study description for the single cell data used in Wang et al.,

Data access approvals are handled by the NIMH through the NIMH Repository and Genomics Resources. Instructions are on the study page. If you do not have access, and have questions about the process let me know.