Question about uORF annotation in NAR paper

Q:
I am examining your uORF annotations with great interest but am unsure how to interpret a few of the entries in the file below on the github site.

Complete list of predictions (complete_uORF_predictions_hg19.zip · 35.29 MB)

If you look at these two uORF_IDs:

ENST00000307677.4.uORF_ATC.5

ENST00000422920.1.uORF_ATA.4

They are annotated with the same start and end coordinates, but different start codons (ATC / ATA).

Also, looking at the region I cannot find either start codon in the hg19 reference.

Any idea what is going on here?

A:
Basically, the start codon here appears to overlie a splice site. Alternative splicing means you could either end up with an ATC or an ATA at that location depending on which processed transcript you are looking at (see image below). That’s why these uORFs have the same start and end coordinate, but different start codons.

We had wrestled a bit with the question of whether or not to call these two separate uORFs. However, they do have different mRNA/protein sequences, so that’s why they received separate entries in our catalog.

MOAT output

Q:
I have a problem with the analysis and I’m not sure if I am using you software properly. I am trying to calculate the mutation burden of some of my samples (similar to the measurements performed here: https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-017-0424-2#Sec2). I ended up trying to using MOAT from the second comment of this post in Biostars (https://www.biostars.org/p/299549/). However I cannot obtain the percentage as (nr.mutations/Mb). I am using MOAT-a using the argument “—wg_signal_mode=n”, I am doing something wrong?

A:
MOAT-a wasn’t meant to be used that way. The simulated variants in MOAT-a are internal data used to calculate p-value significance for elevated mutation burden on the input annotations. You can use MOAT-s to create a simulated variant set, and then calculate (number_of_mutations)/Mb from that.

inquiry about your publication (on papillary kidney cancer)

Q:
we are working on lncRNA. We have learned a lot from you publication
“Whole-genome analysis of papillary kidney cancer finds significant
noncoding alterations” published on PLoS Genet. And we want to get more
detailed information about this study from you. Would like to tell us the
detailed final nucleotide mutated and mutation frequency in NEAT1 and
MALAT1?

If you were kindly offering us this information, it would be very helpful
for us.

A:
Below is the detailed mutation information in NEAT1 (BED format, hg19). The second and fifth mutations are in the same sample. The cohort size is 35.

Article Problem LARVA

Q:
I am reading your article of “LARVA: an integrative framework for large-scale analysis of recurrent variants in noncoding annotations”.And I am really interest in it.But when I run the source code by following the intruductions,I meet some problems.

I put all files in the right places.And I do "make" command successfully,the picture is followed.

A:
When you compile LARVA, the "larva" executable is created in the top level of the LARVA distribution, but it is NOT added to the PATH environment variable. Invoking the LARVA executable as you did would work if the "larva" executable was installed in a standard location like "/usr/bin" or "/usr/local/bin", but since the Makefile creates the executable in the same directory as the .cpp files, you need to invoke it with "./larva", so the Terminal knows to look for the executable in the current directory. Alternatively, you can add the LARVA code directory to your PATH variable like so:

export PATH=~/larva2/code:$PATH

Error reports of larva software

Q1:
I am using larva software to investigate the noncoding hotspot mutation, but one error message was reported as follows:

Error: Mutation counts file example.snv.bed has too few columns on line 1. Expected at least 5, but found 4. Exiting.

The command I used: ./larva -vf example.snv.bed -af example.anno.bed -o larva.out -b

It makes me pretty confused that the “example.snv.bed” file really has 5 columns seperated by tab but the error says only found 4. I have tried a lot but still could not figure it out. Could you please give some help?

#####
The example.snv.bed file likes this:

chrM 5650 5651 BLCA_GD blca01

chrM 8863 8864 BLCA_GD blca01

chr1 1111476 1111477 BLCA_GD blca01

chr1 1632977 1632978 BLCA_GD blca01

chr1 1657153 1657154 BLCA_GD blca01

chr1 2584370 2584371 BLCA_GD blca01
####

The example.anno.bed file likes this:

the fourth column is the annotation info(only subset )

It would be really a great appreciate for your help.

A1:
It looks like the variant file and annotation file excerpts you attached with your email contain the same data (based on columns 1-3). I suspect that wasn’t your intended use of LARVA. Could you please send me the actual set of annotations you’re using? It would be a huge help to uncovering the root cause of the error.

Q2:
As you said, I think maybe the input annotation file is the point that makes an error. Actually, I do not fully apprehend what the annotation file should be.

In your paper published in 2015, the abstract says: "We make LARVA available as a software tool and release our highly mutated annotations as an online resource (larva.gersteinlab.org).”

So, using the highly mutated annotations you provided may be appropriate. However, this website(“larva.gersteinlab.org”) can not be visited any more. I hope you can provide some help.

Sorry to bother you for this little things. I used the RegulomeDB annotation file as the LRVAR’s input annotaion file, and the first error I sent you last time was disappeared, but there was a new error like this:

$ processing chromosomes………………….

Error: Invalid length of 0 in annotation file, line 2

Length must be greater than zero

RegulomeDB annotation file(only the first 4 columns were used): [[see image]]

A2:
I apologize for the accessibility issues with the LARVA website. There was a recent change on the backend that messed up the IP address routing to the website. I’ve contacted our IT people about the issue, but until they fix things on their end, the LARVA website can be accessed with its raw IP address: http://54.164.95.124/

Also, concerning your RegulomeDB issue, the reason you get an "Invalid length of 0" error is because the annotation on the second line uses the same coordinate for start and end. The program considers the annotation length to be (end-start), so the second annotation appears to have zero length, which doesn’t really make sense. In fact, it looks like the entire file is made of single nucleotides. This would make sense for the variant file, but for the annotation file, the intention is that the annotations represent intervals on the genome that perform some function. These are typically regions like exons, promoters, enhancers, etc. The idea is to see if these annotations are being hit with a large number of mutations. Single nucleotides don’t really match that annotation definition.

I hope this helps.

7K ncRNA gene set

Q:
We currently have in WormBase the ‘7K’ set of ncRNA genes as described in
the 2011 Integrative analysis modENCODE paper.

We have been looking at the new ENCODE/modENCODE Comparative analysis paper
in Nature.
This paper describes the supervised prediction of a set of ncRNA genes that
do not overlap existing genes.
It is not obvious where to get details of these predicted genes.

Is there a file of chromosomal locations of these genes that we can have?

Are these predicted ncRNA genes suitable for replacing the old ‘7K’ set of
ncRNA genes?

A:
Hi, yes, you can get these from encodeproject.org/comparative . I do
think these can supplement the 7k.

I’d use the new set at encodeproject.org for a smaller, more high-quality & more conservative set than that in the ’10 paper. -marK