I really enjoyed your paper and am looking forward to using
some of the genomic regions you published at http://metatracks.encodenets.gersteinlab.org/
in my research.
I had a couple of questions about them.
BARs–are those the regions predicted by the random forest, or are they
the training set (bins overlapped by a TF ChIP-seq peak)?
PRMs–I may have missed it, but what is the definition of a "promoter"?
I’m guessing it was -1000 to +200bp around a TSS.
(This is to clarify the sentence "bins at the TSSs of expressed genes"
at the bottom of page 17.)
Since the PRMs don’t all span the same genomic distance, I presume
that only bins predicted by the random forest classifier are included
in the files?
Finally, do you have plans to make (or have already made) available
the software for creating region files of BARs,DRMs and DRM-targets
in other tissues?
The BARs are the output regions of Random Forest. They do greatly overlap with the input training sets though.
The positive examples for learning PRMs are the 100bp bins at exactly the TSSs of expressed genes. Random Forest then learned the feature patterns of these bins, and searched for similar bins in the whole genome.
After the predictions, adjacent bins all predicted as PRMs were merged to form regions. The files available on the supplementary web site contain these regions.
Since the computer programs were written based on the available data from ENCODE, they were not written in a way that can be easily adopted to other situations. We do not currently have a plan to make them available.