ENCODE-Networks Source Code for Context-Specific TF Co-Association Analyses

Q:
Hello,
I am interested in your paper published in Nature, 06 September 2012, “Architecture of the human regulatory network derived from ENCODE data”. In particular, we are interested in the framework of context-specific TF co-association analysis described in this paper. We would like to apply this method on our in-house datasets. It’s exciting that the code for these analyses is “Available soon” (the file “enets21.coassoc-code.tgz” on http://encodenets.gersteinlab.org/). Do you know whether the code for co-association analysis in this paper is available now? If so, it might save us a lot of time. Thanks for your help!

A:
The main machine learning method used for the analysis is RuleFit3 which is available here
http://statweb.stanford.edu/~jhf/r-rulefit/rulefit3/R_RuleFit3.html

Detailed instructions on preparing the input data and computing the various scores are in the supplement of the paper.

I don’t have a polished code package that is ready for use for the general public. The code that I wrote for analyses in the paper is here https://code.google.com/p/tf-coassociation/source/browse/#svn%2Ftrunk%2Fscripts . But I have to warn you that its not designed to work on general datasets as it has scripts that were designed to run on our local cluster. The core functions are in
https://code.google.com/p/tf-coassociation/source/browse/trunk/scripts/assoc.matrix.utils.R . The code is reasonably commented so hopefully it should help.

Mutations in sensitive and ultra-sensitive regions

Q:
I read your paper entitled “Integrative annotation of variants from 1092 humans: application to cancer genomics” in Science from Oct. 4, 2013. Since the mutation in the so-called ultra sensitive regions play an important role in cancer development, I wonder whether it is possible to find out where those mutations are in the ultra sensitive region and what mutations they are? I can’t find them in the paper although they are mentioned.
Is there some where in which I can go and find the mutations?

A:
Thanks for your interest in our paper.
You can find the genomic coordinates of sensitive and ultra-sensitive regions in Data File S3 provided with the supplement of the paper. For the cancer samples we analyzed, you will find the coordinates and detailed information for candidate drivers in Data File S6; this file also lists whether the mutations are in sensitive or ultra-sensitive regions.

Annotation of SNPs as breaking or conserving TF motifs

Q:

Congrats with a very nice paper in Science (Khurana et al., 2013). I am particularly interested in how you are able to score variants in transcription factor binding sites. According to the supplementary methods you say that: "An SNV that breaks a motif is defined as a mutation that decreases the motif-matching score of the TF-binding site to the position weight matrix (PWM) of the motif (relative to the ancestral allele) (8). Conversely, an SNV that conserves a motif is defined as a mutation that increases the motif-matching score of the TF-binding site to the PWM of the motif."

This makes perfectly sense to me. But how do you define the TF-binding site in the first place? I would guess that you are applying a threshold on the motif-matching score here (to reduce the fraction of false positives), and that you then define disruption/conservation of the variant relative to this score. I cannot see any details with respect to this aspect in the paper (as far as I can see).

You refer to Mu et al. (NAR, 2011), I cannot however see any further details there.

I would very much appreciate an explanation of how you find the TF binding sites and if you use any PWM-score thresholds in this respect.

A:

The set of motifs we used in the two papers are the set of TF motifs officially released by the ENCODE project and was used in the ENCODE main publication in 2012 too. The algorithm to detect the motifs is developed by Pouya at MIT. Here is more detail about it.
http://compbio.mit.edu/encode-motifs/

In our paper, we take these motif coordinates and categorized SNVs based on their functional effects you described.

prefix ‘chr’ in liftOver

Q:

From the documentation @http://info.gersteinlab.org/AlleleSeq:

Chain files

Using the chain file, one can use the LifeOver tool to convert the annotation coordinates from reference genome to personal haplotypes.

However, when I tried to liftOver my bed file using maternal.chain, all returned unMapped.

249242013 1

10329 1 0

109 1 0

30199 3 0

My bed file:

chr1 14541 14542

chr1 14652 14653

chr1 14676 14677

chr1 14906 14907

A:

It looked like the liftOver failed because of using different chromosome naming convention in .bed and .chain files. In .bed file chromosomes are named with prefix ‘chr’, while in chain files they don’t have such prefix.

fosmid indel

Q:

I would like to ask about what kind of indels are incorporated into the

diploid genome assembly of the NA12878 individual, available from your lab:

http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_dec16.2012.zip

In the readme it says that 829,454 indels were used to construct this

genome. What makes me confused is that when I perform a BLAST search with

one 1.7 kb deletion from NA12878.2010_06.and.fosmid.deletions.phased.vcf

(P2_M_061510_21_73), it shows up in both the maternal and paternal

haplotypes. Is there any size cutoff used for the indels that have been

selected for this assembly?

A:

Unfortunately, in the latest version, no fosmid indels/SVs were used; only the variant output of GATK Best Practices v3 was used, even though fosmid data was indeed used to construct the earlier versions of the diploid genome. We might include them in the future. Thank you.

rulefit3 in encodenets

Q:
I read your paper about the co-associations among TF binding events, (Architecture of the human regulatory network derived from ENCODE data), and got interested in your original clustering algorithm. Now, in our laboratory, we are developing a new clustering algorithm for a large number of genomic data, and implemented its prototype algorithm. However, the accuracy of our algorithm is not so completed, and we have to evaluate it. Thus, we want to use your algorithm as the fine basis, so how can we use it? If the program is available for us, can you tell us the way to use it?

A:
In that paper we used the Rulefit3 package from Prof. Jerome Friedman; there is an R package available at the link below. Our use of the algorithm is extensively documented in Section C of the Supplementary Materials.

Rulefit3
http://dx.doi.org/10.1214/07-Aoas148
http://www-stat.stanford.edu/~jhf/r-rulefit/rulefit3/R_RuleFit3.html

Architecture of the human regulatory network derived from ENCODE data http://dx.doi.org/10.1038/Nature11245

missing citations in encodenets supplement

Q:

With regards to the paper published in Nature, Architecture of the human regulatory network derived from ENCODE data, I have been perusing the Supplementary Information and find that reference No. 69 seems, to the best of my belief, to have been mapped incorrectly. I would like to provide a quote which, in my understanding, promises a reference to a RuleFit3 manuscript but instead corresponds to a paper concerning Transcriptional Regulation in Mast Cells:

The number of rules is not set a priori but is rather learned from the data itself. Details are provided in the RuleFit3 manuscript69. -P. 14/271

69 Bockamp, E. O. et al. Transcriptional regulation of the stem cell leukemia gene by PU.1 and Elf-1. J. Biol. Chem. 273, 29032-29042 (1998).

A:

It turns out that references 69-71 in section C2 of the supplementary material were not correctly added to the reference list. References 69-71 in later sections refer to the correct articles. Below are the correct citations for refs 69-71 in section C2 of the supplement.

Rulefit3 (ref 69)
Frieman, J. H. & Popescu, B. E. Predictive Learning Via Rule Ensembles. Annals Applied Stat. 2, 916-954, doi:10.1214/07-Aoas148 (2008).
http://dx.doi.org/10.1214/07-Aoas148

the well-known random forest algorithm (ref 70)
Breiman, L. Random forests. Mach Learn 45, 5-32, doi:10.1023/A:1010933404324 (2001). http://dx.doi.org/10.1023/A:1010933404324

the GREAT Functional Annotation server (ref 71)
McLean, C. Y. et al. GREAT improves functional interpretation of cis-regulatory regions. Nature Biotechnology 28, 495-U155, doi:10.1038/nbt.1630 (2010). http://dx.doi.org/10.1038/nbt.1630
http://great.stanford.edu/

Multinet (Unified global network) – academic use

Q:
I read your seminal paper “Interpretation of Genomic Variants Using a Unified Biological Network Approach” recently published in PLoS Computational

Biology. I have a few queries:
Is the network available for academic use?
Can we download the relevant multinet to form hypothesis and do

experiments?

A:
Please find the downloadable network at
http://homes.gersteinlab.org/Khurana-PLoSCompBio-2013/
Posted in Uncategorized | Tagged ek | Leave a reply

Do you need parents’ genotype data?

Q:

I am looking for a tool to detect allele specific expression from resequencing and RNA-seq data. I find AlleleSeq could be quite powerful. I noticed the input for the software needs parents genotype data; it requires a VCF file which contains trio genotype to create maternal and paternal genome. But in my case, if I only have genotype information from a single individual, how could I use AlleleSeq?

A:

You dont have to genotype parents. You only need to have variants phased in any way you can/wish (vcf2diploid tool only looks at one column with info for the individual of interest and does not consider other columns). Having trio sequenced is an easy and, probably, the best way to do it.

If you have the mothers genotype only, then you can phase a good fraction of heterozygous variants. Each unphased variants will be randomly assigned to a particular haplotype, so half of them will also be correct. And, of course, all homozygous variants will be phased.

Mismatches between the paternal and maternal chromosomes

Q:
I believe I have discovered numerous errors in the NA12878 dataset. We are working with the most recent version,
NA12878_diploid_genome_may3_2011. They are all single base pair mismatches between the paternal and maternal chromosomes in regions that the accompanying .map file marks as contigs.

A:
.map file shows continuous equivalent (without gaps) blocks between haplotypes. BUT THEY DO INCLUDE SNPs. So, heterozygous SNPs will result in base mismatch within a block.