Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors

Q:
in your research paper "Classification of human genomic regions based
on experimentally determined binding sites of more than 100
transcription-related factors" you identified regions with extremely
high and low degrees of co-binding, termed respectively "HOT" and "LOT"
regions. Since I’m very interested in this classification, I tried to
reproduce this analysis on transcription-related factor (TRF) data of a
particular cell type.

I first downloaded the Genome Structure Correction scripts from

http://www.encodestatistics.org/svn/genome_structural_correction/python_encode_statistics/trunk/

and ran "block_bootstrap.py" on every pair of TRF data, thus obtaining
a matrix with z-scores. I then computed a raw z-score for each TRF,
defined as the average z-score with all other TRFs in the matrix. I
finally sorted these raw z-scores numerically and normalized them
linearly, so as to assign a weight of 1 to the TRF with the lowest raw
z-score and a weight of 1/n to the TRF with the highest raw z-score.
I’m afraid it is not clear to me what I should do next for the
identification of HOT and LOT regions: I would be very grateful if you
could help me with this analysis.

A:
The z-scores that you have computed can be considered as the global weights of the TRFs, where a TRF that more frequently binds to the same locations of other TRFs receive a lower weight, in order to de-emphasize the global co-binding effects.

For each bin j, the weighted binding score (i.e., degree of region-specific co-occurrence) was computed as d_j = \sum_i w_i s_ij, where i iterates through all TRFs, w_i is the weight of TRF i as defined above, and s_ij is its discretized binding signal at bin j (1-5, 5 for the top 25 percentile and 1 for zeros). The top 1% bins with the highest d_j were defined as the HOT regions, while the 1% bins with the lowest non-zero d_j were defined as the LOT regions.

Please notice that in the original calculations, when the block sampling program was run, the human genome was segmented into three classes, namely DNase hypersensitive peaks, DNase hypersensitive non-peak hotspots, and other regions. The idea was that the prior TRF binding probabilities in these three classes of region could be quite different, and thus they should be separately considered during the sampling process.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s