question re. data in paper “Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors”

Q:
As describe in your paper entitled "Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors", it is mentioned "we identified 13,539 potential enhancers (full list available in the Additional files), among which 50 were randomly chosen". But in the additional files, only 50 enhancer co-ordinates are mentioned. Can you please provide me either the source/list of the all 13,539 enhancers.

Many thanks in anticipation of your quick reply,

A:
see http://encodenets.gersteinlab.org/metatracks

Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors

Q:
in your research paper "Classification of human genomic regions based
on experimentally determined binding sites of more than 100
transcription-related factors" you identified regions with extremely
high and low degrees of co-binding, termed respectively "HOT" and "LOT"
regions. Since I’m very interested in this classification, I tried to
reproduce this analysis on transcription-related factor (TRF) data of a
particular cell type.

I first downloaded the Genome Structure Correction scripts from

http://www.encodestatistics.org/svn/genome_structural_correction/python_encode_statistics/trunk/

and ran "block_bootstrap.py" on every pair of TRF data, thus obtaining
a matrix with z-scores. I then computed a raw z-score for each TRF,
defined as the average z-score with all other TRFs in the matrix. I
finally sorted these raw z-scores numerically and normalized them
linearly, so as to assign a weight of 1 to the TRF with the lowest raw
z-score and a weight of 1/n to the TRF with the highest raw z-score.
I’m afraid it is not clear to me what I should do next for the
identification of HOT and LOT regions: I would be very grateful if you
could help me with this analysis.

A:
The z-scores that you have computed can be considered as the global weights of the TRFs, where a TRF that more frequently binds to the same locations of other TRFs receive a lower weight, in order to de-emphasize the global co-binding effects.

For each bin j, the weighted binding score (i.e., degree of region-specific co-occurrence) was computed as d_j = \sum_i w_i s_ij, where i iterates through all TRFs, w_i is the weight of TRF i as defined above, and s_ij is its discretized binding signal at bin j (1-5, 5 for the top 25 percentile and 1 for zeros). The top 1% bins with the highest d_j were defined as the HOT regions, while the 1% bins with the lowest non-zero d_j were defined as the LOT regions.

Please notice that in the original calculations, when the block sampling program was run, the human genome was segmented into three classes, namely DNase hypersensitive peaks, DNase hypersensitive non-peak hotspots, and other regions. The idea was that the prior TRF binding probabilities in these three classes of region could be quite different, and thus they should be separately considered during the sampling process.

Understanding transcriptional regulation by integrative analysis of transcription factor binding data

Q1:

In the article, it is mentioned that recent studies often had the problem that they were dependent on techniques like microarrays and that is why these studies were not able to measure expression levels of isoforms of some genes very accurately. It is also said that in this study, those problems would not exist, because ENCODE-data was used. So I looked up the ENCODE project, but I am not quite sure, why this data should be more accurate.

A1:
As we described in the paper, the ENCODE generated CAGE data that measures expression level of each TSS (translational start sites) of a gene. The data enable us to know the effect of TF binding signal nearby a TSS to the expression levels of the TSS.

Q2: Another point I am not sure about is, how this model is used. What kind of data you have to introduce to the program? Do you use transcription factor binding data, or are you just choosing your Transcription factor and the Start site sequence and the program is just telling you, what the probablility for getting an mRNA-transcript is? And if the first option is true, why is it easier to get the binding data of Transcription factor than the expression data – because if you have interactions of the chromatin structure, the latter should be more accurate, shouldn’t it?

A2: The Input to the model is: the TF binding signal nearby each TSS (for all TFs with ChIP-seq data available from ENCODE) AND the expression levels of all TSSes. Since we are using a supervised model, we randomly select 2000 TSSes for training the model, and test the performece of the model in the remaining data. I think your confusion is: since it is easy and more accurate to measure gene expression by RNA-seq or other experiments, why bother using ChIP-seq TF binding data to make prediction? The goal of our model is not to predicting gene expression. The goal is to use the model to quanitfy the relationship between gene expression and TF binding. We want to know: How much gene expression can be explained by TF binding signal? Which TF is more important? TF binding at which position contribute more? And other questions.

Q3: I am also curious, if the developed model is already used for the more predictive transcription factors, or if it was not intended to be used. If it was applied, do you know some groups who did so? I’m quite interested, whether they could create consistent data with this method.

A3: To my knowledge, many other groups also test models to study the relationship between gene expression and TF binding and /or histone modification. You may find the paper by Zhengqing OuYang in PNAS (PMID:19995984), by XIanjun Dong in Genome Biology (PMID:22950368) and many other publications. Again, the goal is to understand regulation conferred by TF binding and histone modifications, rather than predict gene expression.

Question regarding paper “Classification of human genomic regions basedon experimentally determined binding sites of more than 100 transcription-related factors”

Q:
I am reading your paper "Classification of human genomic regions basedon experimentally determined binding sites of more than 100 transcription-related factors" and I have some questions.
In figure 1 what do the colors mean?
I also couldn’t understand plots in figure 4. what are the black dots, the error bars and the black line ?
I would be grateful if you answer my questions.

A:
In figure one different colors are used for different types of regions. For each type of regions, one color is used as the background color as one color is used to show the signal level.

Figure four shows standard Box-and Whisker plots (http://en.wikipedia.org/wiki/Box_plot). The dots are the means of the distributions. The upper and lower lines are the non-outlier maximum and minimum values, respectively. The black lines in the middle are the medians.

Modeling the relative relationship of transcription factor binding and histone modifications to gene expression levels in mouse embryonic stem cells

Q:

I am very interested in your recent work "Modeling the relative relationship of transcription factor binding and histone modifications to gene expression levels in mouse embryonic stem cells". Could you please advise with respect to the experimental datasets used in your study? I am now looking at the mouse ESC TFs dataset from Chen et al., 2008, provided in their supplemental Table S3. Could you please advise, whether these data refer to the mm8 mouse genome assembly or the mm9 mouse genome assembly? The row data deposited in GEO seem to be updated in 2012, so they are probably re-mapped to mm9. But do you know what was the initial genome build reported in this paper, mm8 or mm9? (For consistency I want to use the "original" peaks reported by Chen et al., and used in your study, not making peak calling again from their row data).

A:
We are pleased to know your interest on the paper. In terms of the Genome assembly, we were using mm8 as the original paper (Chen et al.).