ENCODE-Networks Source Code for Context-Specific TF Co-Association Analyses

I am interested in your paper published in Nature, 06 September 2012, “Architecture of the human regulatory network derived from ENCODE data”. In particular, we are interested in the framework of context-specific TF co-association analysis described in this paper. We would like to apply this method on our in-house datasets. It’s exciting that the code for these analyses is “Available soon” (the file “enets21.coassoc-code.tgz” on http://encodenets.gersteinlab.org/). Do you know whether the code for co-association analysis in this paper is available now? If so, it might save us a lot of time. Thanks for your help!

The main machine learning method used for the analysis is RuleFit3 which is available here

Detailed instructions on preparing the input data and computing the various scores are in the supplement of the paper.

I don’t have a polished code package that is ready for use for the general public. The code that I wrote for analyses in the paper is here https://code.google.com/p/tf-coassociation/source/browse/#svn%2Ftrunk%2Fscripts . But I have to warn you that its not designed to work on general datasets as it has scripts that were designed to run on our local cluster. The core functions are in
https://code.google.com/p/tf-coassociation/source/browse/trunk/scripts/assoc.matrix.utils.R . The code is reasonably commented so hopefully it should help.

Running FunSeq

I recently read your paper on Funseq, and I am pretty interested in using it in solving some of my interested questions regarding cortex plasticiy. However, I’m not very familiar with Linux/UNIX running environment for this software, and what I have is just a mac laptop….Could you give me some information about how I could use this software on a mac computer, or where I could find some useful information instructing me how I could use this software on a mac computer?

You should be able to download this software on a mac and use it.
You can download it from funseq.gersteinlab.org.

Since you are not familiar with downloading software, have you tried to use the online version at http://funseq.gersteinlab.org/analysis .
You can upload your file and see what you get.

Mutations in sensitive and ultra-sensitive regions

I read your paper entitled “Integrative annotation of variants from 1092 humans: application to cancer genomics” in Science from Oct. 4, 2013. Since the mutation in the so-called ultra sensitive regions play an important role in cancer development, I wonder whether it is possible to find out where those mutations are in the ultra sensitive region and what mutations they are? I can’t find them in the paper although they are mentioned.
Is there some where in which I can go and find the mutations?

Thanks for your interest in our paper.
You can find the genomic coordinates of sensitive and ultra-sensitive regions in Data File S3 provided with the supplement of the paper. For the cancer samples we analyzed, you will find the coordinates and detailed information for candidate drivers in Data File S6; this file also lists whether the mutations are in sensitive or ultra-sensitive regions.

Annotation of SNPs as breaking or conserving TF motifs


Congrats with a very nice paper in Science (Khurana et al., 2013). I am particularly interested in how you are able to score variants in transcription factor binding sites. According to the supplementary methods you say that: "An SNV that breaks a motif is defined as a mutation that decreases the motif-matching score of the TF-binding site to the position weight matrix (PWM) of the motif (relative to the ancestral allele) (8). Conversely, an SNV that conserves a motif is defined as a mutation that increases the motif-matching score of the TF-binding site to the PWM of the motif."

This makes perfectly sense to me. But how do you define the TF-binding site in the first place? I would guess that you are applying a threshold on the motif-matching score here (to reduce the fraction of false positives), and that you then define disruption/conservation of the variant relative to this score. I cannot see any details with respect to this aspect in the paper (as far as I can see).

You refer to Mu et al. (NAR, 2011), I cannot however see any further details there.

I would very much appreciate an explanation of how you find the TF binding sites and if you use any PWM-score thresholds in this respect.


The set of motifs we used in the two papers are the set of TF motifs officially released by the ENCODE project and was used in the ENCODE main publication in 2012 too. The algorithm to detect the motifs is developed by Pouya at MIT. Here is more detail about it.

In our paper, we take these motif coordinates and categorized SNVs based on their functional effects you described.

prefix ‘chr’ in liftOver


From the documentation @http://info.gersteinlab.org/AlleleSeq:

Chain files

Using the chain file, one can use the LifeOver tool to convert the annotation coordinates from reference genome to personal haplotypes.

However, when I tried to liftOver my bed file using maternal.chain, all returned unMapped.

249242013 1

10329 1 0

109 1 0

30199 3 0

My bed file:

chr1 14541 14542

chr1 14652 14653

chr1 14676 14677

chr1 14906 14907


It looked like the liftOver failed because of using different chromosome naming convention in .bed and .chain files. In .bed file chromosomes are named with prefix ‘chr’, while in chain files they don’t have such prefix.

fosmid indel


I would like to ask about what kind of indels are incorporated into the

diploid genome assembly of the NA12878 individual, available from your lab:


In the readme it says that 829,454 indels were used to construct this

genome. What makes me confused is that when I perform a BLAST search with

one 1.7 kb deletion from NA12878.2010_06.and.fosmid.deletions.phased.vcf

(P2_M_061510_21_73), it shows up in both the maternal and paternal

haplotypes. Is there any size cutoff used for the indels that have been

selected for this assembly?


Unfortunately, in the latest version, no fosmid indels/SVs were used; only the variant output of GATK Best Practices v3 was used, even though fosmid data was indeed used to construct the earlier versions of the diploid genome. We might include them in the future. Thank you.

rulefit3 in encodenets

I read your paper about the co-associations among TF binding events, (Architecture of the human regulatory network derived from ENCODE data), and got interested in your original clustering algorithm. Now, in our laboratory, we are developing a new clustering algorithm for a large number of genomic data, and implemented its prototype algorithm. However, the accuracy of our algorithm is not so completed, and we have to evaluate it. Thus, we want to use your algorithm as the fine basis, so how can we use it? If the program is available for us, can you tell us the way to use it?

In that paper we used the Rulefit3 package from Prof. Jerome Friedman; there is an R package available at the link below. Our use of the algorithm is extensively documented in Section C of the Supplementary Materials.


Architecture of the human regulatory network derived from ENCODE data http://dx.doi.org/10.1038/Nature11245