Ask for help about your paper (PLoS ONE, 2010)

Q:
Currently, I’m working on reconstructing gene regulatory network. It’s
really an interesting topic and I would like to estimate which tools is
suitable for our experimental data. I have read your published paper
"Improved Reconstruction of In Silico Gene Regulatory Networks by
Integrating Knockout and Perturbation
Data". In this paper, I can’t understand the section of learning noise from
deletion data.
Step 1: Calculate the probability of regulation Pb->a for each pair of genes
(b,a). I want to know how to calculate this probability, and this value of
probability can decide potential regulation or not?
I want to ask you that how to work in this section, and I’m appreciated if
you can help me to figure out.

A: Basically we used the expression levels currently believed to be
unaffected by a deletion to form a Gaussian background. Then if a gene
has an expression level far away from the mean of this Gaussian
distribution (by calculating the probability that the expression is as
extreme or more extreme than the observed one based on the Gaussian), we
consider the gene to be affected by the deletion.

coevolution webserver

Q:

I have read the article “An integrated system for studying residue coevolution in proteins” and start to use the coevolution webserver http://coevolution.gersteinlab.org/coevolution/. The help page says SCA is one of the coevolution score function, which I can‘t find on the webpage. Could you please tell me what is wrong?

A:
SCA is only available in the version for download, but not at the Web site anymore since it sometimes ran for an extremely long time. If you want to use the SCA method, you can use the download version or the latest SCA method by the Ranganathan group, which I think is better than the version we implemented.

Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors

Q:
in your research paper "Classification of human genomic regions based
on experimentally determined binding sites of more than 100
transcription-related factors" you identified regions with extremely
high and low degrees of co-binding, termed respectively "HOT" and "LOT"
regions. Since I’m very interested in this classification, I tried to
reproduce this analysis on transcription-related factor (TRF) data of a
particular cell type.

I first downloaded the Genome Structure Correction scripts from

http://www.encodestatistics.org/svn/genome_structural_correction/python_encode_statistics/trunk/

and ran "block_bootstrap.py" on every pair of TRF data, thus obtaining
a matrix with z-scores. I then computed a raw z-score for each TRF,
defined as the average z-score with all other TRFs in the matrix. I
finally sorted these raw z-scores numerically and normalized them
linearly, so as to assign a weight of 1 to the TRF with the lowest raw
z-score and a weight of 1/n to the TRF with the highest raw z-score.
I’m afraid it is not clear to me what I should do next for the
identification of HOT and LOT regions: I would be very grateful if you
could help me with this analysis.

A:
The z-scores that you have computed can be considered as the global weights of the TRFs, where a TRF that more frequently binds to the same locations of other TRFs receive a lower weight, in order to de-emphasize the global co-binding effects.

For each bin j, the weighted binding score (i.e., degree of region-specific co-occurrence) was computed as d_j = \sum_i w_i s_ij, where i iterates through all TRFs, w_i is the weight of TRF i as defined above, and s_ij is its discretized binding signal at bin j (1-5, 5 for the top 25 percentile and 1 for zeros). The top 1% bins with the highest d_j were defined as the HOT regions, while the 1% bins with the lowest non-zero d_j were defined as the LOT regions.

Please notice that in the original calculations, when the block sampling program was run, the human genome was segmented into three classes, namely DNase hypersensitive peaks, DNase hypersensitive non-peak hotspots, and other regions. The idea was that the prior TRF binding probabilities in these three classes of region could be quite different, and thus they should be separately considered during the sampling process.

Questions re tool for coevolution analysis

Q:
We are interested in
adding a link to your tool on the website. I have been playing around with
the tool but have been having some difficulty. I loaded the example, ran the
analysis and downloaded the results with no problem. Then, I tried to use
the data from the example to re-run the analysis by uploading the PF01036
fasta file (as an MSA) and listing BACR_HALSA as the reference sequence. I
did not load the tree data or the structure data. I submitted the request
and received the following error:

The coevolution analysis task that you submitted at 2013-09-11 11:35:07.0
could not be completed.

Error message:Not enough sequences.

In order to justify the addition of the link to the website, we need to make
sure that the web interface is simple and easy to use. Can you help me
understand what the problem might be (clearly I have enough sequences since
the example runs without any issues)? Also, can you explain to me the link
between the 1C3W reference structure and the sequence data?

Any help you can give would be greatly appreciated.

A:
The error message is due to an internal check of the system. Since
coevolution analysis requires a good number of sequences to give
reliable statistics, there is a minimum threshold of the number of
sequences after the filtering step. When the example is loaded, I think
some of the filtering settings are made so that it can pass the minimum
threshold. When one manually uploads the MSA, the default settings could
be different.

You may try changing the setting for the minimum number of
sequences in the "Advanced options" section (which is by default hidden)
and rerun. Let me know if you still get any error message.

Completion of job on coevolution server

Q:

we are trying to find the co-evolving positions in a protein family of interest. I had submitted a job on the co-evolution server several days back, but I have not received a response yet. Could you please let me know the estimated time of completion of my job?

A:
There was a long queue of pending tasks, and one of had been stuck in the queue for some time. I have removed it to let the others run. Please see if you can get your results within a day. If not, please let me know and I will check the system again.

Architecture of the human regulatory network derived from

Q:

Re: Architecture of the human regulatory network derived from ENCODE data
10.1038/nature11245

Hi Dr. Gerstein: This is a very nice paper and is very important in my
current study. Do you have tools/software for TF Co-association (figure 1
and supplemental section B and C) mentioned in this paper. Can I get it?

A:
Anshul did the co-association analysis for this Networks paper. I
think he knows that part the best.

As for the co-association analysis in the ENCODE main paper, it can
be repeated using the GSC package available at the ENCODE statistics web
site (http://www.encodestatistics.org/). The first thing you need to do
is to determine (manually or by other means) a segmentation of the
genome, where TF binding is assumed segment-wise stationary. If you have
no specific preference on how the segmentation should be done, you can
use the GSC Python segmentation tool to do that, which will try to
perform an automatic segmentation (the results of which would be better
if you have more data). Then you can run the GSC Python program to
perform segmented block sampling to compute pairwise p-vlaues of your
binding data.

ENCODE data
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit

DREAM 3 challenge & paper “Improved Reconstruction of In Silico Gene Regulatory Networks by Integrating Knockout and Perturbation Data”

Q:

I am interested in exploring further the work did by you and your team
members in DREAM 3 challenge, as reported in the paper stated below. Do you
provide the codes/program for public to view? Thanks.

"Improved Reconstruction of In Silico Gene Regulatory Networks by Integrating Knockout and Perturbation Data"

I am ok with the current software which you said quite tailored to the competition. Please send it to me. Really appreciate it. Thanks.

A:
The current form of the software is quite tailored for the
competition, and we do not have a general, publicly distributable
version. I can send it to you if you think it would be useful.

Please find the version that we submitted to DREAM attached, together with some data and some script files for running it. If you have Apache Ant installed, simply issue the command "ant runall3" to run the program on the DREAM3 files. The size-10 networks are included, and the size-50 and size-100 networks can be downloaded from the DREAM web site.