Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors

Q:
in your research paper "Classification of human genomic regions based
on experimentally determined binding sites of more than 100
transcription-related factors" you identified regions with extremely
high and low degrees of co-binding, termed respectively "HOT" and "LOT"
regions. Since I’m very interested in this classification, I tried to
reproduce this analysis on transcription-related factor (TRF) data of a
particular cell type.

I first downloaded the Genome Structure Correction scripts from

http://www.encodestatistics.org/svn/genome_structural_correction/python_encode_statistics/trunk/

and ran "block_bootstrap.py" on every pair of TRF data, thus obtaining
a matrix with z-scores. I then computed a raw z-score for each TRF,
defined as the average z-score with all other TRFs in the matrix. I
finally sorted these raw z-scores numerically and normalized them
linearly, so as to assign a weight of 1 to the TRF with the lowest raw
z-score and a weight of 1/n to the TRF with the highest raw z-score.
I’m afraid it is not clear to me what I should do next for the
identification of HOT and LOT regions: I would be very grateful if you
could help me with this analysis.

A:
The z-scores that you have computed can be considered as the global weights of the TRFs, where a TRF that more frequently binds to the same locations of other TRFs receive a lower weight, in order to de-emphasize the global co-binding effects.

For each bin j, the weighted binding score (i.e., degree of region-specific co-occurrence) was computed as d_j = \sum_i w_i s_ij, where i iterates through all TRFs, w_i is the weight of TRF i as defined above, and s_ij is its discretized binding signal at bin j (1-5, 5 for the top 25 percentile and 1 for zeros). The top 1% bins with the highest d_j were defined as the HOT regions, while the 1% bins with the lowest non-zero d_j were defined as the LOT regions.

Please notice that in the original calculations, when the block sampling program was run, the human genome was segmented into three classes, namely DNase hypersensitive peaks, DNase hypersensitive non-peak hotspots, and other regions. The idea was that the prior TRF binding probabilities in these three classes of region could be quite different, and thus they should be separately considered during the sampling process.

Questions re tool for coevolution analysis

Q:
We are interested in
adding a link to your tool on the website. I have been playing around with
the tool but have been having some difficulty. I loaded the example, ran the
analysis and downloaded the results with no problem. Then, I tried to use
the data from the example to re-run the analysis by uploading the PF01036
fasta file (as an MSA) and listing BACR_HALSA as the reference sequence. I
did not load the tree data or the structure data. I submitted the request
and received the following error:

The coevolution analysis task that you submitted at 2013-09-11 11:35:07.0
could not be completed.

Error message:Not enough sequences.

In order to justify the addition of the link to the website, we need to make
sure that the web interface is simple and easy to use. Can you help me
understand what the problem might be (clearly I have enough sequences since
the example runs without any issues)? Also, can you explain to me the link
between the 1C3W reference structure and the sequence data?

Any help you can give would be greatly appreciated.

A:
The error message is due to an internal check of the system. Since
coevolution analysis requires a good number of sequences to give
reliable statistics, there is a minimum threshold of the number of
sequences after the filtering step. When the example is loaded, I think
some of the filtering settings are made so that it can pass the minimum
threshold. When one manually uploads the MSA, the default settings could
be different.

You may try changing the setting for the minimum number of
sequences in the "Advanced options" section (which is by default hidden)
and rerun. Let me know if you still get any error message.

Completion of job on coevolution server

Q:

we are trying to find the co-evolving positions in a protein family of interest. I had submitted a job on the co-evolution server several days back, but I have not received a response yet. Could you please let me know the estimated time of completion of my job?

A:
There was a long queue of pending tasks, and one of had been stuck in the queue for some time. I have removed it to let the others run. Please see if you can get your results within a day. If not, please let me know and I will check the system again.

Architecture of the human regulatory network derived from

Q:

Re: Architecture of the human regulatory network derived from ENCODE data
10.1038/nature11245

Hi Dr. Gerstein: This is a very nice paper and is very important in my
current study. Do you have tools/software for TF Co-association (figure 1
and supplemental section B and C) mentioned in this paper. Can I get it?

A:
Anshul did the co-association analysis for this Networks paper. I
think he knows that part the best.

As for the co-association analysis in the ENCODE main paper, it can
be repeated using the GSC package available at the ENCODE statistics web
site (http://www.encodestatistics.org/). The first thing you need to do
is to determine (manually or by other means) a segmentation of the
genome, where TF binding is assumed segment-wise stationary. If you have
no specific preference on how the segmentation should be done, you can
use the GSC Python segmentation tool to do that, which will try to
perform an automatic segmentation (the results of which would be better
if you have more data). Then you can run the GSC Python program to
perform segmented block sampling to compute pairwise p-vlaues of your
binding data.

ENCODE data
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit

DREAM 3 challenge & paper “Improved Reconstruction of In Silico Gene Regulatory Networks by Integrating Knockout and Perturbation Data”

Q:

I am interested in exploring further the work did by you and your team
members in DREAM 3 challenge, as reported in the paper stated below. Do you
provide the codes/program for public to view? Thanks.

"Improved Reconstruction of In Silico Gene Regulatory Networks by Integrating Knockout and Perturbation Data"

I am ok with the current software which you said quite tailored to the competition. Please send it to me. Really appreciate it. Thanks.

A:
The current form of the software is quite tailored for the
competition, and we do not have a general, publicly distributable
version. I can send it to you if you think it would be useful.

Please find the version that we submitted to DREAM attached, together with some data and some script files for running it. If you have Apache Ant installed, simply issue the command "ant runall3" to run the program on the DREAM3 files. The size-10 networks are included, and the size-50 and size-100 networks can be downloaded from the DREAM web site.

PDB data for: Relating Three-Dimensional Structures to Protein Networks Provides Evolutionary Insights

Q:

Regarding your seminal paper "Relating Three-Dimensional Structures to
Protein Networks Provides Evolutionary Insights".
Amongst the supplementary data I could not find the PDB entries that were
used for each interaction in the SIN.
I would much appreciate if you could send me this data.

A:
info. should be on the site
http://networks.gersteinlab.org/structint

Yip et al 2012 Genome Biology

Q:
I really enjoyed your paper and am looking forward to using
some of the genomic regions you published at http://metatracks.encodenets.gersteinlab.org/
in my research.

I had a couple of questions about them.

BARs–are those the regions predicted by the random forest, or are they
the training set (bins overlapped by a TF ChIP-seq peak)?

PRMs–I may have missed it, but what is the definition of a "promoter"?
I’m guessing it was -1000 to +200bp around a TSS.
(This is to clarify the sentence "bins at the TSSs of expressed genes"
at the bottom of page 17.)

Since the PRMs don’t all span the same genomic distance, I presume
that only bins predicted by the random forest classifier are included
in the files?

Finally, do you have plans to make (or have already made) available
the software for creating region files of BARs,DRMs and DRM-targets
in other tissues?

A:
The BARs are the output regions of Random Forest. They do greatly overlap with the input training sets though.

The positive examples for learning PRMs are the 100bp bins at exactly the TSSs of expressed genes. Random Forest then learned the feature patterns of these bins, and searched for similar bins in the whole genome.

After the predictions, adjacent bins all predicted as PRMs were merged to form regions. The files available on the supplementary web site contain these regions.

Since the computer programs were written based on the available data from ENCODE, they were not written in a way that can be easily adopted to other situations. We do not currently have a plan to make them available.