In the article, it is mentioned that recent studies often had the problem that they were dependent on techniques like microarrays and that is why these studies were not able to measure expression levels of isoforms of some genes very accurately. It is also said that in this study, those problems would not exist, because ENCODE-data was used. So I looked up the ENCODE project, but I am not quite sure, why this data should be more accurate.
As we described in the paper, the ENCODE generated CAGE data that measures expression level of each TSS (translational start sites) of a gene. The data enable us to know the effect of TF binding signal nearby a TSS to the expression levels of the TSS.
Q2: Another point I am not sure about is, how this model is used. What kind of data you have to introduce to the program? Do you use transcription factor binding data, or are you just choosing your Transcription factor and the Start site sequence and the program is just telling you, what the probablility for getting an mRNA-transcript is? And if the first option is true, why is it easier to get the binding data of Transcription factor than the expression data – because if you have interactions of the chromatin structure, the latter should be more accurate, shouldn’t it?
A2: The Input to the model is: the TF binding signal nearby each TSS (for all TFs with ChIP-seq data available from ENCODE) AND the expression levels of all TSSes. Since we are using a supervised model, we randomly select 2000 TSSes for training the model, and test the performece of the model in the remaining data. I think your confusion is: since it is easy and more accurate to measure gene expression by RNA-seq or other experiments, why bother using ChIP-seq TF binding data to make prediction? The goal of our model is not to predicting gene expression. The goal is to use the model to quanitfy the relationship between gene expression and TF binding. We want to know: How much gene expression can be explained by TF binding signal? Which TF is more important? TF binding at which position contribute more? And other questions.
Q3: I am also curious, if the developed model is already used for the more predictive transcription factors, or if it was not intended to be used. If it was applied, do you know some groups who did so? I’m quite interested, whether they could create consistent data with this method.
A3: To my knowledge, many other groups also test models to study the relationship between gene expression and TF binding and /or histone modification. You may find the paper by Zhengqing OuYang in PNAS (PMID:19995984), by XIanjun Dong in Genome Biology (PMID:22950368) and many other publications. Again, the goal is to understand regulation conferred by TF binding and histone modifications, rather than predict gene expression.