Question re ENCODE data on website

Q:

I’ve been incorporating the encode data from your webpage in my analyzes
(http://encodenets.gersteinlab.org/). The data is fantastic, but I have
some
questions regarding the enets*.GM_proximal_*filtered_network.txt data
sets.

The filtered dataset actually contains more regulators than the
unfiltered data
set, making me speculate that the unfiltered data file is not complete:
[bb447@compute-8-2 TF]$ cut -f1
enets6.GM_proximal_unfiltered_network.txt | sort
-u | wc -l
50
[bb447@compute-8-2 TF]$ cut -f1 enets8.GM_proximal_filtered_network.txt
| sort
-u | wc -l
67

Could it be possible that the file is incomplete?

A:
the updated files are uploaded to the site. thanks again for pointing this out.

Architecture of the human regulatory network derived from

Q:

Re: Architecture of the human regulatory network derived from ENCODE data
10.1038/nature11245

Hi Dr. Gerstein: This is a very nice paper and is very important in my
current study. Do you have tools/software for TF Co-association (figure 1
and supplemental section B and C) mentioned in this paper. Can I get it?

A:
Anshul did the co-association analysis for this Networks paper. I
think he knows that part the best.

As for the co-association analysis in the ENCODE main paper, it can
be repeated using the GSC package available at the ENCODE statistics web
site (http://www.encodestatistics.org/). The first thing you need to do
is to determine (manually or by other means) a segmentation of the
genome, where TF binding is assumed segment-wise stationary. If you have
no specific preference on how the segmentation should be done, you can
use the GSC Python segmentation tool to do that, which will try to
perform an automatic segmentation (the results of which would be better
if you have more data). Then you can run the GSC Python program to
perform segmented block sampling to compute pairwise p-vlaues of your
binding data.

ENCODE data
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit

Data received – Re: Your model and input data to the “…integrative analysis of transcription factor binding data” paper

Q:
Many thanks for the excellent ENCODE papers! This is an unprecedented source for life scientists, and we appreciate that accordingly!

Would you be so kind as to access your model and input data your random forest model that predicts gene expression based on transcription factor binding?

Could you please also name the source of TSS CAGE? At UCSC, our only suspects were the Riken CAGE*TSS files, or CSHL LongRNA and ShortRNA files.
We would like to run and to adapt your model to the extremely tight co-regulation of ribosome protein genes. We believe that the ENCODE TF’s may account for a major part of their regulation.

Naturally, we would properly cite your works (incl. Cheng & Gerstein, 2011). Should you prefer, we are open to any reasonable forms of collaboration.

A:

See http://archive.gersteinlab.org/proj/chromodel

The human TSS CAGE data are from Roderic’s Lab.

here is the Human CAGE TSS file:
ftp://genome.crg.es/pub/Encode/data_analysis/TSS/Gencodev7_CAGE_TSS_clusters_June2011.gff.gz

here is a readme file:
ftp://genome.crg.es/pub/Encode/data_analysis/TSS/Gencodev7_CAGE_TSS_clusters_June2011.txt

and here are some additional explanations of how the file was made:
ftp://genome.crg.es/pub/Encode/data_analysis/TSS/Gencodev7_CAGE_TSS_clusters_june2011.pdf

ENCODE-Networks Source Code for Context-Specific TF Co-Association Analyses

Q:
Hello,
I am interested in your paper published in Nature, 06 September 2012, “Architecture of the human regulatory network derived from ENCODE data”. In particular, we are interested in the framework of context-specific TF co-association analysis described in this paper. We would like to apply this method on our in-house datasets. It’s exciting that the code for these analyses is “Available soon” (the file “enets21.coassoc-code.tgz” on http://encodenets.gersteinlab.org/). Do you know whether the code for co-association analysis in this paper is available now? If so, it might save us a lot of time. Thanks for your help!

A:
The main machine learning method used for the analysis is RuleFit3 which is available here
http://statweb.stanford.edu/~jhf/r-rulefit/rulefit3/R_RuleFit3.html

Detailed instructions on preparing the input data and computing the various scores are in the supplement of the paper.

I don’t have a polished code package that is ready for use for the general public. The code that I wrote for analyses in the paper is here https://code.google.com/p/tf-coassociation/source/browse/#svn%2Ftrunk%2Fscripts . But I have to warn you that its not designed to work on general datasets as it has scripts that were designed to run on our local cluster. The core functions are in
https://code.google.com/p/tf-coassociation/source/browse/trunk/scripts/assoc.matrix.utils.R . The code is reasonably commented so hopefully it should help.

missing citations in encodenets supplement

Q:

With regards to the paper published in Nature, Architecture of the human regulatory network derived from ENCODE data, I have been perusing the Supplementary Information and find that reference No. 69 seems, to the best of my belief, to have been mapped incorrectly. I would like to provide a quote which, in my understanding, promises a reference to a RuleFit3 manuscript but instead corresponds to a paper concerning Transcriptional Regulation in Mast Cells:

The number of rules is not set a priori but is rather learned from the data itself. Details are provided in the RuleFit3 manuscript69. -P. 14/271

69 Bockamp, E. O. et al. Transcriptional regulation of the stem cell leukemia gene by PU.1 and Elf-1. J. Biol. Chem. 273, 29032-29042 (1998).

A:

It turns out that references 69-71 in section C2 of the supplementary material were not correctly added to the reference list. References 69-71 in later sections refer to the correct articles. Below are the correct citations for refs 69-71 in section C2 of the supplement.

Rulefit3 (ref 69)
Frieman, J. H. & Popescu, B. E. Predictive Learning Via Rule Ensembles. Annals Applied Stat. 2, 916-954, doi:10.1214/07-Aoas148 (2008).
http://dx.doi.org/10.1214/07-Aoas148

the well-known random forest algorithm (ref 70)
Breiman, L. Random forests. Mach Learn 45, 5-32, doi:10.1023/A:1010933404324 (2001). http://dx.doi.org/10.1023/A:1010933404324

the GREAT Functional Annotation server (ref 71)
McLean, C. Y. et al. GREAT improves functional interpretation of cis-regulatory regions. Nature Biotechnology 28, 495-U155, doi:10.1038/nbt.1630 (2010). http://dx.doi.org/10.1038/nbt.1630
http://great.stanford.edu/