I am currently running HiC-spector on mouse genome datasets with bin size 5kb. I noticed that it requires quite a lot of memory, so I was wondering if there were tests done on HiC-spector’s space complexity, as I couldn’t find such studies in the Supplementary Data.
We didn’t do analysis explicitly. Because the contact maps are stored as sparse matrices, the memory won’t grow quadratically. In general, if calculation is done chromosome by chromosome, 5kb should be fine.
I plan to use your program OrthoClust but I am a little confused on the input of the network files. The README just says a list of the nodes; if the first column is node A is the second column a node co-associated with node A?
You are right. A net file is simply what some people call an edgelist. Two numbers in a row form an edge. You can see examples in the data folder found on Github.
I am contacting you regarding the OrthoClust program that your group has on github and had a couple questions about how to apply the program to new datasets. First, how was the co-appearance matrix calculated from the OrthoClust output? Second, is it necessary to modify the initial number of spin states (q) or the coupling constant (k) parameters that were used in the 2014 Genome Biology paper? I am not able to find options in the current release and wondering where these values can be changed in the code?
The current implementation in github is based on a heuristic, rather than the simulated annealing method used in the 2014 Genome Biology paper. The initial number of spin state q is no longer a parameter you have to supply. It’s set to be the total number of nodes in the system. As explained in the readme, the coupling constant k is supplied in one of the input files (the coupling information file). It should be the 3rd column of the file. in my example (ortho_info file found in data folder), the third column is all 1, meaning k=1.
For the co-occurrence matrix, notice that the output file is a tab delimited file which consists of three columns. The 1st and 2nd columns are the species id and the gene id given by the input files. The 3rd column is a module id. Suppose there are N1 genes in species 1 and N2 genes in species 2, the co-appearance matrix has dim (N1+N2) by (N1+N2). One should build a map between the genes in individual species to the indices running from 1 to (N1+N2). Suppose there are n genes in module 1, then all the pair-wise combination of these n genes should be marked as 1 in the corresponding matrix elements.
One output file can be used to make a co-appearance matrix (with only 0 and 1). If you have multiple output files from multiple runs of the algorithm, you will arrive at a final co-appearance matrix shown in the Genome biology paper by adding the results together. Of course, in order to make a plot like the heat map shown in the paper, one has to further perform clustering to arrange the rows and columns.
If you use Julia, I may be able to send you a little script.
I am reading your paper, and have problem about the TF-target gene network data downloaded from http://encodenets.gersteinlab.org/. I want to know which refGene and gene symbol did you use when you find the TF target gene with ChIP-seq data? I find that some symbols are not concluded in hg19 refGene I download from ucsc.
the server was down for a while, and I wasn’t sure what names were you talking about. Now, I think the names are from gencode, but I cannot recall the exact release we used. I believe the names wouldn’t change in general. you can see all the releases here, the names should be in one of the metafiles.
Recently, I have read one of your paper titled “Comparative analysis of regulatory information and circuits across distant species”. In this paper, you wrote that you used simulated annealing to reveal the organization of regulatory factors in three layers of master-regulators, intermediate regulators, and low-level regulators. However, I can’t find the program for this method or the references related to this method. I want to use this method to class the TFs in my own regulatory network. Can you kindly provided this program for me?
An initial version of the code is available from encodenets.gersteinlab.org.
The code used for the analysis can be found
more recently, our group published an updated method. the code will be released very soon.
Recently, I am trying to use the OrthoClust R package for multiple species network clustering. I did not found the control parameters in annealing process as described in your paper: "Standard simulated annealing was employed. Spin values were randomly assigned initially, and updated via a heat bath algorithm. The initial temperature was chosen in a way such that the flipping rate (the probability that a node changes its spin state) was higher than 1 – 1/q. The temperature was gradually decreased with a cooling factor 0.9, until the flipping rate was less than 1%." I did also not found the simulation annealing algorithm in the matlab file OrthoClustN.m (represented by a greedy algorithm). Please help me solve this problem. Thank you for your time.
The annealing procedure is very slow for practical problems. in the revision stage of our manuscript, we discovered the greedy algorithm (Louvain algorithm) and therefore wrapped up the matlab code, and implemented in R too. it’s a very well regard algorithm, and we strongly encourage you to try the matlab code for your purpose.
I just read your recently published paper on OrthoClust approach. It is a well grounded work in both practically and mathematically point of views.
I ran your R scripts for my own data and It worked perfectly fine, however I am wondering how can I use the script for more than two species?
It could be appreciated if you help me to find the solution.
Thanks for your interest in OrthoClust. Orthoclust definitely works on more than 2. The R script is a primitive version for illustrating the concept outlined in the paper. We understand the importance of N-species generalization. We have put a new MATLAB code for N-species. It made use of an efficient code written by Mucha and Porter that implemented the Louvain algorithm for modularity optimization. The 3rd party code as well as our wrapper is now in the gersteinlab github.
Apart from MATLAB, we are planning to provide wrapper for Python or R later.
The N-species code is not exactly the thing we did for the paper. So if you find any bug or question, please let me know. we are trying to make a more user friendly package anyway.
I recently read your article “Construction
and Analysis of an Integrated Regulatory Network Derived from
High-Throughput Sequencing Data”. In the last year, I measured mRNA and
miRNA expression in the different types of mouse skeletal muscle fibers to
discover the different regulatory circuits activated in fast and slow
myofibers. I designed a preliminary network using the databases of miRNA –
target mRNA and protein – protein interactions, and I have started to
include my expression data in order to understand the biological meaning. I
was wondering if it is possible to use your more accurate mouse regulatory
network for my data. Is this network free to use? In the article and in the
website of your laboratory I did not find any file or link with the complete
networks that you describe. I am not a computational biologist, but the
paper is very interesting and I think that the network that you design with
your method could be very useful for the scientific community.
Hereby I attach three files for our three mouse networks. 1) how miRNAs targeting genes (This is not our calculation, but downloaded from TargetScan).
2) how TFs targeting genes, 3) how TFs targeting miRNAs based on ChIP-Seq data of 12 TFs.
The files are in plain text format. The first column is the list of regulators and the second column is the list of targets. The bracket next to a gene name gives the class of the gene, TF for transcription factors, MIR for miRNAs, and X for non-TF protein-coding genes.
Thank you for your interest of our paper. I hope this information will be useful for your work.
I’ve been incorporating the encode data from your webpage in my analyzes
(http://encodenets.gersteinlab.org/). The data is fantastic, but I have
questions regarding the enets*.GM_proximal_*filtered_network.txt data
The filtered dataset actually contains more regulators than the
set, making me speculate that the unfiltered data file is not complete:
[bb447@compute-8-2 TF]$ cut -f1
enets6.GM_proximal_unfiltered_network.txt | sort
-u | wc -l
[bb447@compute-8-2 TF]$ cut -f1 enets8.GM_proximal_filtered_network.txt
-u | wc -l
Could it be possible that the file is incomplete?
the updated files are uploaded to the site. thanks again for pointing this out.
I am very familiar with the ENCODE TF datasets, as I’ve been applying it to various problems in my PhD. I was interested in the expression analysis across human tissues for the ((miR –> TF) –> targets) FFL. There is a reference in the Supplementary file (section H) to the protein-coding expression atlas Su et al. 2004, for the TF and protein-coding targets in this loop, but doesn’t seem to be a ref for the corresponding expression data for miRNAs? I assume it would be Landgraf et al. 2007 ‘A mammalian microRNA expression atlas based on small RNA library sequencing’, since this allows matched tissues and samples with Su et al. However, it might be some other dataset. It would be helpful to be able to replicate/extend the FFL analysis using the correct data. Would you be able to forward this email to the relevent person(s) to confirm whether microRNA expression was taken from Landgraf atlas? Many thanks for your help
Slight correction: The FFL studied for expression pattern of
components is the other way round: ((TF –> miR) –> targets).
the miRNA expression is actually from
Lu et al, Nature 2005
if you go to
under the heading "MicroRNA Expression Profiles Classify Human Cancers"