Inforequest :”Annotation Transfer Between Genomes: Protein–Protein Interologs and Protein–DNA Regulogs”

Q:

Recently I have read one of your article : "Annotation Transfer Between Genomes: Protein–Protein Interologs and Protein–DNA Regulogs".

If it is possible can you send the academic version of this program for linux to me? or is there any location where can I download the implementation?

A:
see web link via http://papers.gersteinlab.org/papers/interolog

ncRNA position doesn’t match

Q:
I have downloaded the psiDR for comparing the results with previously posted
lincRNAs at the UCSC web site.

I found that the following entry doesn’t match with the current hg19
positions assigned at the UCSC genome browser:

gene_id "ENSG00000224184.1"; transcript_id "ENSG00000224184.1"; gene_type
"lincRNA"; gene_status "NOVEL"; gene_name "AC096559.1"; transcript_type
"lincRNA"; transcript_status "NOVEL"; transcript_name "AC096559.1"; level 2;
tag "ncRNA_host"; havana_gene "OTTHUMG00000151709.2";

Coordinates at psiDR are: chr2:11,988,748-12,718,474

Coordinates at UCSC are: chr2:12,716,164-12,783,038

Don’t know whether or not that happens with the coordinates of other
elements.

I can’t find a way to explain this difference other than a mistake in the
annotation process, but maybe I’m wrong and there is a better explanation.

A:
We use the GENCODE gene annotation model. If you check Ensembl for "ENSG00000224184.1", you will see that it matches the coordinates at psidDR.
I think the UCSC track includes the actual clone boundaries. You can e-mail to the UCSC help desk. They are generally very responsive. Please bear in mind that coordinates also change a bit with updated genome assembly as well refined gene annotation models.

ACT software question

Q:
I tried to use your ACT software for aggregation plot and slightly confused.
If it’s possible can you please look at my input (at the end of this email) and
tell me where I have misunderstanding?

1) Why position -2 (-15bp) has signal 6.7 instead of (10+5+7)/3=7.3?
2) Why positions <-2 (25bp away) have any values and how these values were obtained (as my signal is only up to 17bp)?
3) How values for -1 and 0 were obtained?

Thank you in advance,
Hennady.

bed.txt

chr1 20 25 +
chr1 280 288 +

signal.txt

chr1 1 10
chr1 2 5
chr1 3 7
chr1 300 6
chr1 301 8
chr1 302 9

execution
-bash-3.2$ python ACT.py –nbins=10 –mbins=0 –radius=100 bed.txt signal.txt
# ACT.py –nbins=10 –mbins=0 –radius=100 bed.txt signal.txt
# annotationCount: 2
Bin Center mean stdev
-10 -95 3.5 0.0
-9 -85 3.5 0.0
-8 -75 3.5 0.0
-7 -65 3.5 0.0
-6 -55 3.5 0.0
-5 -45 3.5 0.0
-4 -35 3.5 0.0
-3 -25 3.5 0.0
-2 -15 6.7 3.12889756943
-1 -5 7.0 0.0
0 4 7.0 0.0
1 14 7.0 0.0
2 24 7.8 3.69684550214
3 34 8.0 1.41421356237
4 44 8.0 1.41421356237
5 54 8.0 1.41421356237
6 64 8.0 1.41421356237
7 74 8.0 1.41421356237
8 84 8.0 1.41421356237
9 94 8.0 1.41421356237

A:
The default for ACT is to assume that the signal file is a step-wise signal input, so for example in your signal.txt file, all positions on chr1 between nucleotides 3 and 300 are assigned the value 7 (hence it is not acting as your calculation below might suggest).

In addition, your bed file has two annotations (20 to 25 and 280 to 288, both +). For the positions <-2 bins away, only the values upstream of the 280 to 288 annotation are used.

Yip et al 2012 Genome Biology

Q:
I really enjoyed your paper and am looking forward to using
some of the genomic regions you published at http://metatracks.encodenets.gersteinlab.org/
in my research.

I had a couple of questions about them.

BARs–are those the regions predicted by the random forest, or are they
the training set (bins overlapped by a TF ChIP-seq peak)?

PRMs–I may have missed it, but what is the definition of a "promoter"?
I’m guessing it was -1000 to +200bp around a TSS.
(This is to clarify the sentence "bins at the TSSs of expressed genes"
at the bottom of page 17.)

Since the PRMs don’t all span the same genomic distance, I presume
that only bins predicted by the random forest classifier are included
in the files?

Finally, do you have plans to make (or have already made) available
the software for creating region files of BARs,DRMs and DRM-targets
in other tissues?

A:
The BARs are the output regions of Random Forest. They do greatly overlap with the input training sets though.

The positive examples for learning PRMs are the 100bp bins at exactly the TSSs of expressed genes. Random Forest then learned the feature patterns of these bins, and searched for similar bins in the whole genome.

After the predictions, adjacent bins all predicted as PRMs were merged to form regions. The files available on the supplementary web site contain these regions.

Since the computer programs were written based on the available data from ENCODE, they were not written in a way that can be easily adopted to other situations. We do not currently have a plan to make them available.